-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add T2I-Adapter model and pipeline #2555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@williamberman has probably the best XP to help here :-) |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@williamberman @patrickvonplaten @sayakpaul I think this PR is ready for review. 🙌 |
Hey @HimariO, Sorry that we're a bit slow here. I'll try to review today (or Monday the latest) |
hey @HimariO thank you so much for your work on this :) My initial reaction is that the sideloading mechanism adds a bit too much "magic". Could we have a description on why we need to add a mixin as such? From my initial reading here my understanding was t2i follows the same pattern as controlnet which should just require passing the inputs to the unet's forward |
Hi @williamberman, although Adapter and ControlNet have a lot of similarities, there are some differences in their feature fusing schema. I've included two diagrams that illustrate these differences below. The rectangles and trapezoids that are paired with the same color represent the The main difference will be:
As we previously discussed in #2331, I too believe that more research like ControlNet will be happening in the near future. Therefore, providing a convenient way to experiment/integrate new ideas like controlling different modules or different fusing methods through a sideload approach could be valuable and more scalable compared to the straightforward method of passing the control signal in a top-down fashion. Let me know if you have any questions or concerns! |
Hey @HimariO, Thanks a lot for your design that is super useful! I'm with @williamberman here - I think we should try to simplify the code and make the design more similar to controlnet. I've made a quick design proposal PR here: #2708 - would this design choice be ok for you? We don't want to add a whole new design pattern for a new SD controlling algorithm. Instead we should try to make the fewest possible changes given what we already have - so we should try to strongly adapt this PR to the ControlNet one. @HimariO could you maybe try to adapt your PR to conform a bit more to the design in #2708 . I think this could work nicely no? |
Feel free to copy anything you need from #2708 - it's just there as a design proposal for you |
@patrickvonplaten Thanks for the proposal, I will look into it later, probably after I finish reading the source code of the new CoAdapter @TencentARC release not too long ago. |
Nice! Thanks for the diagram super helpful :) Looks like can still be handled by directly passing the values through the forward methods. I think patrick's proposal makes a lot of sense. |
@patrickvonplaten 's is indeed very clean and keeps the change to a minimum. One use case #2708 may have some trouble handling will be using Adapter and ControlNet at the same time(ex: color-adapter + canny-controlnet). But I think we can just stick with #2708 for now(?). |
Ah I see, yes could we maybe try to stick to #2708 for now and see in a follow-up how we could adapt things? |
@patrickvonplaten @williamberman this PR is updated with #2708. |
src/diffusers/models/adapter.py
Outdated
from .resnet import Downsample2D | ||
|
||
|
||
class ResnetBlock(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use our existing ResnetBlock2D
? It looks very similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think of that before, but the existence of ResnetBlock2D.norm1
, ResnetBlock2D.norm2
, and the time step embedding argument in the forward method prevented me from using it to implement the adapter,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the time embedding is optional, feel free to make a small change to the block definition where time_embedding_norm can take the value "no_norm" which when set will set norm1 and norm2 to None and then they're skipped in the forward :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HimariO any progress on replacing the resnet block? happy to help if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I later spot some more differences:
- Adapter sometimes uses kernel size 1 in the second conv2d module of the ResNet block, so we will need to add one(or two, since there are two conv2d in the ResNet block) more parameters to the
ResnetBlock2D
- Adapter use the "conv2d -> activation -> conv2d" pattern, and the
ResnetBlock2D
use "activation -> conv2d -> activation -> conv2d" pattern. Not sure there is any good way to bypassResnetBlock2D
's first activation function.
Not sure it is a good idea to make those changes to ResnetBlock2D
or not, some help/suggestions here will be great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to look into it tomorrow! I think it'd be totally fine though to slighly adapt the existing ResnetBlock
| Model Name | Control Image Overview| Control Image Example | Generated Image Example | | ||
|---|---|---|---| | ||
|[RzZ/sd-v1.4-adapter-color](https://huggingface.co/RzZ/sd-v1-4-adapter-color/)<br/> *Trained with spatial color palette* | A image with 8x8 color palette.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/sample_output.png"/></a>| | ||
|[RzZ/sd-v1.4-adapter-canny](https://huggingface.co/RzZ/sd-v1-4-adapter-canny)<br/> *Trained with canny edge detection* | A monochrome image with white edges on a black background.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-canny/resolve/main/sample_output.png"/></a>| | ||
|[RzZ/sd-v1.4-adapter-sketch](https://huggingface.co/RzZ/sd-v1-4-adapter-sketch)<br/> *Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_input.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-sketch/resolve/main/sample_output.png"/></a>| | ||
|[RzZ/sd-v1.4-adapter-depth](https://huggingface.co/RzZ/sd-v1-4-adapter-depth)<br/> *Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-depth/resolve/main/sample_output.png"/></a>| | ||
|[RzZ/sd-v1.4-adapter-openpose](https://huggingface.co/RzZ/sd-v1-4-adapter-openpose)<br/> *Trained with OpenPose bone image* | A [OpenPose bone](https://github.com/CMU-Perceptual-Computing-Lab/openpose) image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-openpose/resolve/main/sample_output.png"/></a>| | ||
|[RzZ/sd-v1.4-adapter-keypose](https://huggingface.co/RzZ/sd-v1-4-adapter-keypose)<br/> *Trained with mmpose skeleton image* | A [mmpose skeleton](https://github.com/open-mmlab/mmpose) image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-keypose/resolve/main/sample_output.png"/></a>| | ||
|[RzZ/sd-v1.4-adapter-seg](https://huggingface.co/RzZ/sd-v1-4-adapter-seg)<br/>*Trained with semantic segmentation* | An [custom](https://github.com/TencentARC/T2I-Adapter/discussions/25) segmentation protocol image.|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_input.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_input.png"/></a>|<a href="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_output.png"><img width="64" src="https://huggingface.co/RzZ/sd-v1-4-adapter-seg/resolve/main/sample_output.png"/></a> | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO - upload weights to https://huggingface.co/TencentARC
>>> image = load_image("https://huggingface.co/RzZ/sd-v1-4-adapter-color/resolve/main/color_ref.png") | ||
|
||
>>> color_palette = image.resize((8, 8)) | ||
>>> color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST) | ||
|
||
>>> import torch | ||
>>> from diffusers import StableDiffusionAdapterPipeline, Adapter | ||
|
||
>>> adapter = Adapter.from_pretrained("RzZ/sd-v1-4-adapter-color") | ||
>>> pipe = StableDiffusionAdapterPipeline.from_pretrained( | ||
... "RzZ/sd-v1-4-adapter", | ||
... adapter=adapter, | ||
... torch_dtype=torch.float16, | ||
... ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO update hub repos
src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py
Outdated
Show resolved
Hide resolved
else: | ||
self.out_conv = None | ||
|
||
self.block1 = nn.Conv2d(mid_c, mid_c, 3, 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line and the two following lines is different to what we have currently in the UNet and why it can be called "BottleNeck". Nevertheless the rest is exactly the same as far as I can see
Load adapter module with from_pretrained Prototyping generalized adapter framework Writeup doc string for sideload framework(WIP) + some minor update on implementation Update adapter models Remove old adapter optional args in UNet Add StableDiffusionAdapterPipeline unit test Handle cpu offload in StableDiffusionAdapterPipeline Auto correct coding style Update model repo name to "RzZ/sd-v1-4-adapter-pipeline" Refactor MultiAdapter to better compatible with config system Export MultiAdapter Create pipeline document template from controlnet Create dummy objects Supproting new AdapterLight model Fix StableDiffusionAdapterPipeline common pipeline test [WIP] Update adapter pipeline document Handle num_inference_steps in StableDiffusionAdapterPipeline Update definition of Adapter "channels_in" Update documents Apply code style Fix doc typo and merge error Update doc string and example Quality of life improvement Remove redundant code and file from prototyping Remove unused pageage Remove comments Fix title Fix typo Add conditioning scale arg Bring back old implmentation Offload sideload Add supply info on document Update src/diffusers/models/adapter.py Co-authored-by: Will Berman <[email protected]> Update MultiAdapter constructor Swap out custom checkpoint and update pipeline constructor Update docment Apply suggestions from code review Co-authored-by: Will Berman <[email protected]> Correcting style Following single-file policy Update auto size in image preprocess func Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py Co-authored-by: Will Berman <[email protected]> fix copies Update adapter pipeline behavior Add adapter_conditioning_scale doc string Add the missing doc string Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> Fix few bugs from suggestion Handle L-mode PIL image as control image Rename to differentiate adapter resblock Update src/diffusers/models/adapter.py Co-authored-by: Sayak Paul <[email protected]> Fix typo Update adapter parameter name Update test case and code style Fix copies Fix typo Update src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_adapter.py Co-authored-by: Will Berman <[email protected]> Update Adapter class name Add checkpoint converting script Fix style Fix-copies
eae1c2b
to
675f0d1
Compare
Co-authored-by: Patrick von Platen <[email protected]>
cc @williamberman it'd be great if you could try to unblock this PR by helping on the resnet refactor :-) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
sorry for the delay here @HimariO I've had a few too many things on my plate and haven't been able to get to the resnet refactor :) |
No worries @williamberman , take your time. Let me know if you need help or if you have an idea on how to approach the factor. In the meantime, I'll try to keep this pull request up to date with the main branch. |
I think, having one pipeline for controlnet and T2I makes sense here, both are identical and can be used same way. With current rate of new AI projects coming, i think interoperability is what we should go for. Or thing like "plugin pipelines" is what will be needed, where instead of updating diffusers, anyone can make pipeline which can be loaded as plugin, i think extending community pipelines the way its loaded, will help in it.
Right now, new diffusers version upgrade is needed for any new pipeline we want to use. Then, just side loading loading pipeline will do the trick. Also lots of overhead will be removed from diffusers team to support never ending list of projects in diffusers. Then diffusers team can launch new pipelines instead of new diffusers. There can be official pipelines and community one's. |
We don't want to entangle different concepts here. ControlNet and T2I should be different pipelines. @williamberman if you're too busy this week, I can pick up this issue here |
@@ -872,6 +878,9 @@ def custom_forward(*inputs): | |||
|
|||
output_states += (hidden_states,) | |||
|
|||
if additional_residuals is not None: | |||
hidden_states += additional_residuals |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HimariO
The implementation above is not correct. you also need to update the output_states
with new hidden_states, so the code should look like this to match original implementation. Without it the results do not match original version
if additional_residuals is not None:
hidden_states += additional_residuals
output_states = output_states[:-1] + (hidden_states,)
hey @HimariO sorry for the repeated delays here, I was taking a look today at getting this running on my machine and I couldn't get the conversion script to work for the canny model. Could you add some examples on how to convert the original models? specifically the conversion script will write out the state dict but the statedict isn't compatible with the model config in your repos |
Okay to close this in favor of #3932 no? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This PR implements the T2I-Adapter, related pipeline, and model sideloading mechanism discussed in #2390.
Model/Pipeline description
T2I-Adapter by @TencentARC is
Usage Examples
TODO
style andcolor adapter that the author just released yesterdayDiscussion
According to the author some of the adapter models is still work in progress(depth and style adapter), although those adapters seem to be working reasonably well. Not sure it is the right time to include them in this PR.CLIPFeatureExtractor
of the newly released style adapter, I think it's better to exclude it from this PR for now, so we don't keep expanding the scope of this PR@sayakpaul @wfng92