Add T2I-Adapter model and pipeline #2555

HimariO · 2023-03-05T07:02:15Z

This PR implements the T2I-Adapter, related pipeline, and model sideloading mechanism discussed in #2390.

Model/Pipeline description

... a simple and small (~70M parameters, ~300M storage space) network that can provide extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models.
T2I-Adapter aligns internal knowledge in T2I models with external control signals. We can train various adapters according to different conditions, and achieve rich control and editing effects.

Usage Examples

import torch
from diffusers import StableDiffusionAdapterPipeline, Adapter

adapter = Adapter.from_pretrained("RzZ/sd-v1-4-adapter-color")
pipe = StableDiffusionAdapterPipeline.from_pretrained(
    "RzZ/sd-v1-4-adapter",
    adapter=adapter,
    torch_dtype=torch.float16,
).to("cuda")

out_images = pipe(prompts, images).images

import torch
from diffusers import StableDiffusionAdapterPipeline, Adapter, MultiAdapter

adapters = [
    Adapter.from_pretrained("RzZ/sd-v1-4-adapter-keypose"),
    Adapter.from_pretrained("RzZ/sd-v1-4-adapter-depth"),
]

pipe = StableDiffusionAdapterPipeline.from_pretrained(
    "RzZ/sd-v1-4-adapter",
    adapter=adapter,
    torch_dtype=torch.float16,
).to("cuda")

out_images = pipe(
    ["A man waling in an office room with nice view"],
    [[cond_image_keypose, cond_image_depth]],
).images

TODO

Implment Adapter model
Implment StableDiffusionAdapterPipeline
Create test cases for StableDiffusionAdapterPipeline
Support for multi-adapter (adapter composition)
Add support for the ~~style and~~ color adapter that the author just released yesterday
Create document for StableDiffusionAdapterPipeline
Refine model & pipeline doc strings
Cleaning up development scripts & apply code style fixes

Discussion

According to the author some of the adapter models is still work in progress(depth and style adapter), although those adapters seem to be working reasonably well. Not sure it is the right time to include them in this PR.
Due to the vastly different architecture and decency of CLIPFeatureExtractor of the newly released style adapter, I think it's better to exclude it from this PR for now, so we don't keep expanding the scope of this PR
@sayakpaul @wfng92

patrickvonplaten · 2023-03-07T12:29:34Z

@williamberman has probably the best XP to help here :-)

HuggingFaceDocBuilderDev · 2023-03-08T16:17:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

HimariO · 2023-03-09T16:41:33Z

@williamberman @patrickvonplaten @sayakpaul I think this PR is ready for review. 🙌

patrickvonplaten · 2023-03-10T12:50:06Z

Hey @HimariO,

Sorry that we're a bit slow here. I'll try to review today (or Monday the latest)

williamberman · 2023-03-16T05:18:35Z

hey @HimariO thank you so much for your work on this :)

My initial reaction is that the sideloading mechanism adds a bit too much "magic". Could we have a description on why we need to add a mixin as such? From my initial reading here my understanding was t2i follows the same pattern as controlnet which should just require passing the inputs to the unet's forward

HimariO · 2023-03-16T13:29:02Z

Hi @williamberman, although Adapter and ControlNet have a lot of similarities, there are some differences in their feature fusing schema. I've included two diagrams that illustrate these differences below. The rectangles and trapezoids that are paired with the same color represent the CrossAttnDownBlock2D or CrossAttnUpBlock2D:

The main difference will be:

Adapter needs to fuse its control signal with hidden states from the layer inside the CrossAttnDownBlock2D(one layer deeper into UNet).
The fused hidden state will be passed on to the following downsample blocks instead of upsample blocks through residual connections like ControlNet.

As we previously discussed in #2331, I too believe that more research like ControlNet will be happening in the near future. Therefore, providing a convenient way to experiment/integrate new ideas like controlling different modules or different fusing methods through a sideload approach could be valuable and more scalable compared to the straightforward method of passing the control signal in a top-down fashion.

Let me know if you have any questions or concerns!

src/diffusers/models/adapter.py

patrickvonplaten · 2023-03-16T16:17:02Z

Hey @HimariO,

Thanks a lot for your design that is super useful! I'm with @williamberman here - I think we should try to simplify the code and make the design more similar to controlnet.

I've made a quick design proposal PR here: #2708 - would this design choice be ok for you?

We don't want to add a whole new design pattern for a new SD controlling algorithm. Instead we should try to make the fewest possible changes given what we already have - so we should try to strongly adapt this PR to the ControlNet one.

@HimariO could you maybe try to adapt your PR to conform a bit more to the design in #2708 . I think this could work nicely no?

patrickvonplaten · 2023-03-16T16:17:29Z

Feel free to copy anything you need from #2708 - it's just there as a design proposal for you

HimariO · 2023-03-16T17:12:25Z

@patrickvonplaten Thanks for the proposal, I will look into it later, probably after I finish reading the source code of the new CoAdapter @TencentARC release not too long ago.

williamberman · 2023-03-17T00:01:47Z

Hi @williamberman, although Adapter and ControlNet have a lot of similarities, there are some differences in their feature fusing schema. I've included two diagrams that illustrate these differences below. The rectangles and trapezoids that are paired with the same color represent the CrossAttnDownBlock2D or CrossAttnUpBlock2D:

The main difference will be:

Adapter needs to fuse its control signal with hidden states from the layer inside the CrossAttnDownBlock2D(one layer deeper into UNet).

The fused hidden state will be passed on to the following downsample blocks instead of upsample blocks through residual connections like ControlNet.

As we previously discussed in #2331, I too believe that more research like ControlNet will be happening in the near future. Therefore, providing a convenient way to experiment/integrate new ideas like controlling different modules or different fusing methods through a sideload approach could be valuable and more scalable compared to the straight_forward_ method of passing the control signal in a top-down fashion.

Let me know if you have any questions or concerns!

Nice! Thanks for the diagram super helpful :) Looks like can still be handled by directly passing the values through the forward methods. I think patrick's proposal makes a lot of sense.

HimariO · 2023-03-17T06:39:24Z

@patrickvonplaten 's is indeed very clean and keeps the change to a minimum. One use case #2708 may have some trouble handling will be using Adapter and ControlNet at the same time(ex: color-adapter + canny-controlnet). But I think we can just stick with #2708 for now(?).

patrickvonplaten · 2023-03-17T12:42:03Z

n and keeps the change to a minimum. One use case #2708 may have some trouble handling will be using Adapter and ControlNet at the same time(ex: color-adapter + canny-controlnet). But I think we can just stick with #2708 for now(?).

Ah I see, yes could we maybe try to stick to #2708 for now and see in a follow-up how we could adapt things?
Let me know if you need any help :-)

HimariO · 2023-03-17T18:33:07Z

@patrickvonplaten @williamberman this PR is updated with #2708.

src/diffusers/models/adapter.py

williamberman · 2023-03-17T18:53:38Z

src/diffusers/models/adapter.py

+from .resnet import Downsample2D
+
+
+class ResnetBlock(nn.Module):


Would it be possible to use our existing ResnetBlock2D? It looks very similar

I did think of that before, but the existence of ResnetBlock2D.norm1, ResnetBlock2D.norm2, and the time step embedding argument in the forward method prevented me from using it to implement the adapter,

the time embedding is optional, feel free to make a small change to the block definition where time_embedding_norm can take the value "no_norm" which when set will set norm1 and norm2 to None and then they're skipped in the forward :)

@HimariO any progress on replacing the resnet block? happy to help if needed

I later spot some more differences:

Adapter sometimes uses kernel size 1 in the second conv2d module of the ResNet block, so we will need to add one(or two, since there are two conv2d in the ResNet block) more parameters to the ResnetBlock2D

Adapter use the "conv2d -> activation -> conv2d" pattern, and the ResnetBlock2D use "activation -> conv2d -> activation -> conv2d" pattern. Not sure there is any good way to bypass ResnetBlock2D's first activation function.

Not sure it is a good idea to make those changes to ResnetBlock2D or not, some help/suggestions here will be great :)

I'll try to look into it tomorrow! I think it'd be totally fine though to slighly adapt the existing ResnetBlock

src/diffusers/models/adapter.py

docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

williamberman · 2023-03-17T19:19:06Z

docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

+| Model Name | Control Image Overview| Control Image Example | Generated Image Example |
+|---|---|---|---|
+|[RzZ/sd-v1.4-adapter-color](https://huggingface.co/RzZ/sd-v1-4-adapter-color/)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_output.png"/></a>|
+|[RzZ/sd-v1.4-adapter-canny](https://huggingface.co/RzZ/sd-v1-4-adapter-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_output.png"/></a>|
+|[RzZ/sd-v1.4-adapter-sketch](https://huggingface.co/RzZ/sd-v1-4-adapter-sketch)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_output.png"/></a>|
+|[RzZ/sd-v1.4-adapter-depth](https://huggingface.co/RzZ/sd-v1-4-adapter-depth)<br/> *Trained with Midas depth estimation*  | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_output.png"/></a>|
+|[RzZ/sd-v1.4-adapter-openpose](https://huggingface.co/RzZ/sd-v1-4-adapter-openpose)<br/> *Trained with OpenPose bone image*  | A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_output.png"/></a>|
+|[RzZ/sd-v1.4-adapter-keypose](https://huggingface.co/RzZ/sd-v1-4-adapter-keypose)<br/> *Trained with mmpose skeleton image*  | A [mmpose skeleton](https://github.com/open-mmlab/mmpose) image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_output.png"/></a>|
+|[RzZ/sd-v1.4-adapter-seg](https://huggingface.co/RzZ/sd-v1-4-adapter-seg)<br/>*Trained with semantic segmentation*  | An [custom](https://github.com/TencentARC/T2I-Adapter/discussions/25) segmentation protocol image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_output.png"/></a> |


TODO - upload weights to https://huggingface.co/TencentARC

docs/source/en/api/pipelines/stable_diffusion/adapter.mdx

docs/source/en/using-diffusers/controlling_generation.mdx

williamberman · 2023-03-17T19:35:12Z

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py

+        >>> image = load_image("https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/color_ref.png")
+
+        >>> color_palette = image.resize((8, 8))
+        >>> color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)
+
+        >>> import torch
+        >>> from diffusers import StableDiffusionAdapterPipeline, Adapter
+
+        >>> adapter = Adapter.from_pretrained("RzZ/sd-v1-4-adapter-color")
+        >>> pipe = StableDiffusionAdapterPipeline.from_pretrained(
+        ...     "RzZ/sd-v1-4-adapter",
+        ...     adapter=adapter,
+        ...     torch_dtype=torch.float16,
+        ... )


TODO update hub repos

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py

src/diffusers/models/adapter.py

patrickvonplaten · 2023-04-19T14:56:31Z

src/diffusers/models/adapter.py

+        else:
+            self.out_conv = None
+
+        self.block1 = nn.Conv2d(mid_c, mid_c, 3, 1, 1)


this line and the two following lines is different to what we have currently in the UNet and why it can be called "BottleNeck". Nevertheless the rest is exactly the same as far as I can see

src/diffusers/models/adapter.py

Load adapter module with from_pretrained Prototyping generalized adapter framework Writeup doc string for sideload framework(WIP) + some minor update on implementation Update adapter models Remove old adapter optional args in UNet Add StableDiffusionAdapterPipeline unit test Handle cpu offload in StableDiffusionAdapterPipeline Auto correct coding style Update model repo name to "RzZ/sd-v1-4-adapter-pipeline" Refactor MultiAdapter to better compatible with config system Export MultiAdapter Create pipeline document template from controlnet Create dummy objects Supproting new AdapterLight model Fix StableDiffusionAdapterPipeline common pipeline test [WIP] Update adapter pipeline document Handle num_inference_steps in StableDiffusionAdapterPipeline Update definition of Adapter "channels_in" Update documents Apply code style Fix doc typo and merge error Update doc string and example Quality of life improvement Remove redundant code and file from prototyping Remove unused pageage Remove comments Fix title Fix typo Add conditioning scale arg Bring back old implmentation Offload sideload Add supply info on document Update src/diffusers/models/adapter.py Co-authored-by: Will Berman <[email protected]> Update MultiAdapter constructor Swap out custom checkpoint and update pipeline constructor Update docment Apply suggestions from code review Co-authored-by: Will Berman <[email protected]> Correcting style Following single-file policy Update auto size in image preprocess func Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py Co-authored-by: Will Berman <[email protected]> fix copies Update adapter pipeline behavior Add adapter_conditioning_scale doc string Add the missing doc string Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> Fix few bugs from suggestion Handle L-mode PIL image as control image Rename to differentiate adapter resblock Update src/diffusers/models/adapter.py Co-authored-by: Sayak Paul <[email protected]> Fix typo Update adapter parameter name Update test case and code style Fix copies Fix typo Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py Co-authored-by: Will Berman <[email protected]> Update Adapter class name Add checkpoint converting script Fix style Fix-copies

Co-authored-by: Patrick von Platen <[email protected]>

patrickvonplaten · 2023-04-27T15:09:26Z

cc @williamberman it'd be great if you could try to unblock this PR by helping on the resnet refactor :-)

github-actions · 2023-05-22T15:03:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

williamberman · 2023-05-22T17:34:46Z

sorry for the delay here @HimariO I've had a few too many things on my plate and haven't been able to get to the resnet refactor :)

HimariO · 2023-05-25T17:37:37Z

No worries @williamberman , take your time. Let me know if you need help or if you have an idea on how to approach the factor. In the meantime, I'll try to keep this pull request up to date with the main branch.

adhikjoshi · 2023-06-04T04:50:10Z

I think, having one pipeline for controlnet and T2I makes sense here, both are identical and can be used same way. With current rate of new AI projects coming, i think interoperability is what we should go for.

Or thing like "plugin pipelines" is what will be needed, where instead of updating diffusers, anyone can make pipeline which can be loaded as plugin, i think extending community pipelines the way its loaded, will help in it.

pipe = DiffusionPipeline.from_pretrained("stablediffusionapi/edge-of-realism").to("cuda")
pipe.load_custom_pipeline("t2i")

Right now, new diffusers version upgrade is needed for any new pipeline we want to use. Then, just side loading loading pipeline will do the trick. Also lots of overhead will be removed from diffusers team to support never ending list of projects in diffusers.

Then diffusers team can launch new pipelines instead of new diffusers. There can be official pipelines and community one's.

@williamberman @patrickvonplaten

patrickvonplaten · 2023-06-05T10:42:31Z

I think, having one pipeline for controlnet and T2I makes sense here, both are identical and can be used same way. With current rate of new AI projects coming, i think interoperability is what we should go for.

Or thing like "plugin pipelines" is what will be needed, where instead of updating diffusers, anyone can make pipeline which can be loaded as plugin, i think extending community pipelines the way its loaded, will help in it.
pipe = DiffusionPipeline.from_pretrained("stablediffusionapi/edge-of-realism").to("cuda")
pipe.load_custom_pipeline("t2i")
Right now, new diffusers version upgrade is needed for any new pipeline we want to use. Then, just side loading loading pipeline will do the trick. Also lots of overhead will be removed from diffusers team to support never ending list of projects in diffusers.

Then diffusers team can launch new pipelines instead of new diffusers. There can be official pipelines and community one's.

@williamberman @patrickvonplaten

We don't want to entangle different concepts here. ControlNet and T2I should be different pipelines. @williamberman if you're too busy this week, I can pick up this issue here

bonlime · 2023-07-03T10:47:05Z

src/diffusers/models/unet_2d_blocks.py

@@ -872,6 +878,9 @@ def custom_forward(*inputs):

            output_states += (hidden_states,)

+        if additional_residuals is not None:
+            hidden_states += additional_residuals


@HimariO
The implementation above is not correct. you also need to update the output_states with new hidden_states, so the code should look like this to match original implementation. Without it the results do not match original version

if additional_residuals is not None: hidden_states += additional_residuals output_states = output_states[:-1] + (hidden_states,)

williamberman · 2023-07-03T18:37:49Z

hey @HimariO sorry for the repeated delays here, I was taking a look today at getting this running on my machine and I couldn't get the conversion script to work for the canny model. Could you add some examples on how to convert the original models?

specifically the conversion script will write out the state dict but the statedict isn't compatible with the model config in your repos

sayakpaul · 2023-07-14T08:09:34Z

Okay to close this in favor of #3932 no?

github-actions · 2023-08-07T15:03:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten requested a review from williamberman March 7, 2023 11:55

HimariO changed the title ~~[WIP] Add T2I-Adapter model and pipeline~~ Add T2I-Adapter model and pipeline Mar 9, 2023

takuma104 mentioned this pull request Mar 9, 2023

Add support for Multi-ControlNet to StableDiffusionControlNetPipeline #2627

Merged

3 tasks

patrickvonplaten reviewed Mar 16, 2023

View reviewed changes

src/diffusers/models/adapter.py Outdated Show resolved Hide resolved

patrickvonplaten mentioned this pull request Mar 16, 2023

[Don't merge] T2I - Design proposition #2708

Closed

williamberman reviewed Mar 17, 2023

View reviewed changes

src/diffusers/models/adapter.py Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

src/diffusers/models/adapter.py Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

docs/source/en/api/pipelines/stable_diffusion/adapter.mdx Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

docs/source/en/api/pipelines/stable_diffusion/adapter.mdx Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

docs/source/en/api/pipelines/stable_diffusion/adapter.mdx Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

docs/source/en/using-diffusers/controlling_generation.mdx Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

docs/source/en/using-diffusers/controlling_generation.mdx Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py Outdated Show resolved Hide resolved

williamberman reviewed Mar 17, 2023

View reviewed changes

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py Outdated Show resolved Hide resolved

patrickvonplaten reviewed Apr 19, 2023

View reviewed changes