-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add UniDiffuser model and pipeline #2963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 245 commits
115e382
10c54cb
945f300
322b5cb
af0c3a7
7a1d100
bbabf3f
068d6b4
9dd6058
74907ee
0be9f8b
d6ae0ae
7f3cb6d
1184b36
fa6a6b4
a256f84
e9cb03e
c98e41d
653b3c1
7fa3b6c
cc14690
6f12a36
16ddd8b
07731e9
7a39b0f
7ae597f
00a5e55
ff5b99b
c8eaea5
1fac211
f3300a8
fc3760d
6058378
e5335f3
4afb911
b320c6b
3417b1f
4a6aee9
3306b04
abd21da
f99a9ff
206b9b6
6ad4392
4366b0c
e21784e
006ae03
dac4d4a
c98a055
15a90e2
cf35763
c729403
43c90b0
49c9b4c
a69502f
3267649
de05ea0
8953209
4e03663
167cb7a
0431637
9e2f445
81950af
711119a
416f31a
7ad77dd
3acc879
4c73947
81d7eba
f83fbbd
4cc60b5
163c33b
59986b6
a640f1b
7880ed7
8def721
97cf386
cf2bf70
76e5941
1147c76
cd13b10
5f3b10a
9585b23
79706a7
12868b1
2c87f65
a80f696
72a8467
716c255
c1dce20
5151f21
ffe6e92
10d856a
029a28f
fd47d7c
220657b
1b95720
08fbaaf
799015c
c38d004
6a84a74
863bb75
c8cc4f0
6e8d065
d38b4d9
0d22064
5a75a8a
5ea3424
7815c41
0e8f4f0
7929587
3e8d3d8
b4aa419
458847e
140ab74
652dbaa
1d213de
434b255
7097dd7
fc85263
9d39bef
1cb726a
e62b32a
fc540b5
54c495f
8dd7b0b
34a40ad
0cddc3c
16fd515
5728328
abd6fca
ae7d549
2b92111
a46e1ec
a7f50f4
8a57342
006ab49
8f2d325
a54d631
fa9e387
19a20a5
28dda62
de8794c
7242f1b
1a58958
f36df41
1bc2b91
5341450
b1a6f22
54cfa3b
10e3774
4d656b5
be4abff
848b7e6
e56fab2
ecaf07f
c161e29
926c7fb
edbadcc
6b35c03
f46593e
ec7fb87
029c96c
6644d11
c221086
f670e08
7266fc1
4b76097
6e297b4
fec7bd1
e162d49
caa080c
75c2f75
f0c0f00
42eabb8
1965acf
7b7b6bf
c998614
1085f3e
188de89
89a8f73
cb4016d
41763f6
622c3c6
f001e07
80c2e55
d749d57
6ce7f8f
480b525
d3b3855
9a31cce
df625f4
8065462
d5f65dc
2b11926
63abfce
32162aa
3f5a176
3019e08
bb1172b
2a16062
68a97bd
ce072e0
ee10c71
9388b3a
2ef1b00
62d9c72
368f9ad
68441bf
eb7ae28
bb1e25a
9a195d7
09ddb88
e36596c
147da83
2d8e089
53e37b8
8eae86d
55ca69b
a8219e8
a3e1153
01b42e4
d22535a
b78e854
a9ac5a8
634cf1f
5782887
f61028f
30329a2
ca87f4d
cdf38f1
51f0951
4ccb2b5
97e8eef
9f7247c
302fde9
9f84416
6326cb7
6d0f321
73504c4
0ed1857
d53026d
0adb0a8
43b8894
a5a9dac
d4b11aa
98ce17d
f8c325a
b4feac8
4f21661
07d68d7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,213 @@ | ||||||
<!--Copyright 2023 The HuggingFace Team. All rights reserved. | ||||||
|
||||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||||||
the License. You may obtain a copy of the License at | ||||||
|
||||||
http://www.apache.org/licenses/LICENSE-2.0 | ||||||
|
||||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||||||
specific language governing permissions and limitations under the License. | ||||||
--> | ||||||
|
||||||
# UniDiffuser | ||||||
|
||||||
The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://arxiv.org/abs/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu. | ||||||
|
||||||
The abstract of the [paper](https://arxiv.org/abs/2303.06555) is the following: | ||||||
|
||||||
*This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).* | ||||||
|
||||||
Resources: | ||||||
|
||||||
* [Paper](https://arxiv.org/abs/2303.06555). | ||||||
* [Original Code](https://github.com/thu-ml/unidiffuser). | ||||||
|
||||||
Available Checkpoints are: | ||||||
- *UniDiffuser-v0 (512x512 resolution)* [dg845/unidiffuser-diffusers-v0](https://huggingface.co/dg845/unidiffuser-diffusers-v0) | ||||||
- *UniDiffuser-v1 (512x512 resolution)* [dg845/unidiffuser-diffusers-v1](https://huggingface.co/dg845/unidiffuser-diffusers) | ||||||
|
||||||
## Available Pipelines: | ||||||
|
||||||
| Pipeline | Tasks | Demo | ||||||
|---|---|:---:| | ||||||
| [UniDiffuserPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/pipeline_unidiffuser.py) | *Joint Image-Text Gen*, *Text-to-Image*, *Image-to-Text*, *Image Gen*, *Text Gen*, *Image Variation*, *Text Variation* | | | ||||||
|
||||||
## Usage Examples | ||||||
|
||||||
Because the UniDiffuser model is trained to model the joint distribution of (image, text) pairs, it is capable of performing a diverse range of generation tasks. | ||||||
|
||||||
### Unconditional Image and Text Generation | ||||||
|
||||||
Unconditional generation (where we start from only latents sampled from a standard Gaussian prior) from a `UniDiffuserPipeline` will produce a (image, text) pair: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
So, that the hyperlink is automatically rendered. |
||||||
|
||||||
```python | ||||||
import torch | ||||||
|
||||||
from diffusers import UniDiffuserPipeline | ||||||
|
||||||
device = "cuda" | ||||||
model_id_or_path = "dg845/unidiffuser-diffusers" | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To change. Internally discussing to transfer the weights. |
||||||
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) | ||||||
pipe.to(device) | ||||||
|
||||||
# Unconditional image and text generation. The generation task is automatically inferred. | ||||||
sample = pipe(num_inference_steps=20, guidance_scale=8.0) | ||||||
image = sample.images[0] | ||||||
text = sample.text[0] | ||||||
image.save("unidiffuser_joint_sample_image.png") | ||||||
print(text) | ||||||
``` | ||||||
|
||||||
This is also called "joint" generation in the UniDiffusers paper, since we are sampling from the joint image-text distribution. | ||||||
|
||||||
Note that the generation task is inferred from the inputs used when calling the pipeline. | ||||||
It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]: | ||||||
|
||||||
```python | ||||||
# Equivalent to the above. | ||||||
pipe.set_joint_mode() | ||||||
sample = pipe(num_inference_steps=20, guidance_scale=8.0) | ||||||
``` | ||||||
|
||||||
When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting the infer the mode. | ||||||
You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode. | ||||||
|
||||||
You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively): | ||||||
|
||||||
```python | ||||||
# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance | ||||||
# Image-only generation | ||||||
pipe.set_image_mode() | ||||||
sample_image = pipe(num_inference_steps=20).images[0] | ||||||
# Text-only generation | ||||||
pipe.set_text_mode() | ||||||
sample_text = pipe(num_inference_steps=20).text[0] | ||||||
sayakpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
``` | ||||||
|
||||||
### Text-to-Image Generation | ||||||
|
||||||
UniDiffuser is also capable of sampling from conditional distributions; that is, the distribution of images conditioned on a text prompt or the distribution of texts conditioned on an image. | ||||||
Here is an example of sampling from the conditional image distribution (text-to-image generation or text-conditioned image generation): | ||||||
|
||||||
```python | ||||||
import torch | ||||||
|
||||||
from diffusers import UniDiffuserPipeline | ||||||
|
||||||
device = "cuda" | ||||||
model_id_or_path = "dg845/unidiffuser-diffusers" | ||||||
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) | ||||||
pipe.to(device) | ||||||
|
||||||
# Text-to-image generation | ||||||
prompt = "an elephant under the sea" | ||||||
|
||||||
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) | ||||||
t2i_image = sample.images[0] | ||||||
t2i_image.save("unidiffuser_text2img_sample_image.png") | ||||||
``` | ||||||
|
||||||
The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`]. | ||||||
|
||||||
### Image-to-Text Generation | ||||||
|
||||||
Similarly, UniDiffuser can also produce text samples given an image (image-to-text or image-conditioned text generation): | ||||||
|
||||||
```python | ||||||
import requests | ||||||
import torch | ||||||
from PIL import Image | ||||||
from io import BytesIO | ||||||
|
||||||
from diffusers import UniDiffuserPipeline | ||||||
|
||||||
device = "cuda" | ||||||
model_id_or_path = "dg845/unidiffuser-diffusers" | ||||||
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) | ||||||
pipe.to(device) | ||||||
|
||||||
# Image-to-text generation | ||||||
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" | ||||||
response = requests.get(image_url) | ||||||
init_image = Image.open(BytesIO(response.content)).convert("RGB") | ||||||
init_image = init_image.resize((512, 512)) | ||||||
|
||||||
sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) | ||||||
i2t_text = sample.text[0] | ||||||
print(text) | ||||||
sayakpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
``` | ||||||
|
||||||
The `img2text` mode requires that an input `image` be supplied. You can set the `img2text` mode manually with [`UniDiffuser.set_image_to_text_mode`]. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
### Image Variation | ||||||
|
||||||
The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and the perform a text-to-image generation on the outputs of the first generation. | ||||||
This produces a new image which is semantically similar to the input image: | ||||||
|
||||||
```python | ||||||
import requests | ||||||
import torch | ||||||
from PIL import Image | ||||||
from io import BytesIO | ||||||
|
||||||
from diffusers import UniDiffuserPipeline | ||||||
|
||||||
device = "cuda" | ||||||
model_id_or_path = "dg845/unidiffuser-diffusers" | ||||||
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) | ||||||
pipe.to(device) | ||||||
|
||||||
# Image variation can be performed with a image-to-text generation followed by a text-to-image generation: | ||||||
# 1. Image-to-text generation | ||||||
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" | ||||||
response = requests.get(image_url) | ||||||
init_image = Image.open(BytesIO(response.content)).convert("RGB") | ||||||
init_image = init_image.resize((512, 512)) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we can follow the same as https://github.com/huggingface/diffusers/pull/2963/files#r1205061596 for loading and resizing the image? |
||||||
|
||||||
sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0) | ||||||
i2t_text = sample.text[0] | ||||||
print(text) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: should be |
||||||
|
||||||
# 2. Text-to-image generation | ||||||
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0) | ||||||
final_image = sample.images[0] | ||||||
final_image.save("unidiffuser_image_variation_sample.png") | ||||||
``` | ||||||
|
||||||
### Text Variation | ||||||
|
||||||
|
||||||
Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation: | ||||||
|
||||||
```python | ||||||
import torch | ||||||
|
||||||
from diffusers import UniDiffuserPipeline | ||||||
|
||||||
device = "cuda" | ||||||
model_id_or_path = "dg845/unidiffuser-diffusers" | ||||||
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) | ||||||
pipe.to(device) | ||||||
|
||||||
# Text variation can be performed with a text-to-image generation followed by a image-to-text generation: | ||||||
# 1. Text-to-image generation | ||||||
prompt = "an elephant under the sea" | ||||||
|
||||||
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) | ||||||
t2i_image = sample.images[0] | ||||||
t2i_image.save("unidiffuser_text2img_sample_image.png") | ||||||
|
||||||
# 2. Image-to-text generation | ||||||
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0) | ||||||
final_prompt = sample.text[0] | ||||||
print(final_prompt) | ||||||
``` | ||||||
sayakpaul marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## UniDiffuserPipeline | ||||||
[[autodoc]] UniDiffuserPipeline | ||||||
- all | ||||||
- __call__ | ||||||
|
||||||
## ImageTextPipelineOutput | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also wanted to know if there's any argument control the number of images / text I wanted to generate as a part of the variation mode. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can control the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For modes which generate only text ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It feels more natural to me to have the documentation for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've gone ahead and moved the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand. But we will soon update that too :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, so would it be better if I move it to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's keep it as is for now. Then we will bulk move things :) |
||||||
[[autodoc]] ImageTextPipelineOutput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, let's add a link to the original demo. @hysts is working on to change the demo to have
diffusers
usage.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prepared a Colab Notebook from your awesome documentation: huggingface/notebooks#377
Also prepared this GIF to showcase the powerfulness of the pipeline:
