Skip to content

Commit 57ac673

Browse files
a-r-r-o-wstaoxiaostevhliuhlky
authored
Refactor OmniGen (#10771)
* OmniGen model.py * update OmniGenTransformerModel * omnigen pipeline * omnigen pipeline * update omnigen_pipeline * test case for omnigen * update omnigenpipeline * update docs * update docs * offload_transformer * enable_transformer_block_cpu_offload * update docs * reformat * reformat * reformat * update docs * update docs * make style * make style * Update docs/source/en/api/models/omnigen_transformer.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/using-diffusers/omnigen.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/using-diffusers/omnigen.md Co-authored-by: Steven Liu <[email protected]> * update docs * revert changes to examples/ * update OmniGen2DModel * make style * update test cases * Update docs/source/en/api/pipelines/omnigen.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/using-diffusers/omnigen.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/using-diffusers/omnigen.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/using-diffusers/omnigen.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/using-diffusers/omnigen.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/using-diffusers/omnigen.md Co-authored-by: Steven Liu <[email protected]> * update docs * typo * Update src/diffusers/models/embeddings.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/models/attention.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/models/transformers/transformer_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/models/transformers/transformer_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/models/transformers/transformer_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/pipelines/omnigen/pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/pipelines/omnigen/pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/pipelines/omnigen/pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * Update tests/pipelines/omnigen/test_pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * Update tests/pipelines/omnigen/test_pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/pipelines/omnigen/pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/pipelines/omnigen/pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * Update src/diffusers/pipelines/omnigen/pipeline_omnigen.py Co-authored-by: hlky <[email protected]> * consistent attention processor * updata * update * check_inputs * make style * update testpipeline * update testpipeline * refactor omnigen * more updates * apply review suggestion --------- Co-authored-by: shitao <[email protected]> Co-authored-by: Steven Liu <[email protected]> Co-authored-by: hlky <[email protected]>
1 parent 81440fd commit 57ac673

File tree

7 files changed

+207
-474
lines changed

7 files changed

+207
-474
lines changed

docs/source/en/api/models/omnigen_transformer.md

+11
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,17 @@ specific language governing permissions and limitations under the License.
1414

1515
A Transformer model that accepts multimodal instructions to generate images for [OmniGen](https://github.com/VectorSpaceLab/OmniGen/).
1616

17+
The abstract from the paper is:
18+
19+
*The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.*
20+
21+
```python
22+
import torch
23+
from diffusers import OmniGenTransformer2DModel
24+
25+
transformer = OmniGenTransformer2DModel.from_pretrained("Shitao/OmniGen-v1-diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
26+
```
27+
1728
## OmniGenTransformer2DModel
1829

1930
[[autodoc]] OmniGenTransformer2DModel

docs/source/en/api/pipelines/omnigen.md

+7-33
Original file line numberDiff line numberDiff line change
@@ -19,27 +19,7 @@
1919

2020
The abstract from the paper is:
2121

22-
*The emergence of Large Language Models (LLMs) has unified language
23-
generation tasks and revolutionized human-machine interaction.
24-
However, in the realm of image generation, a unified model capable of handling various tasks
25-
within a single framework remains largely unexplored. In
26-
this work, we introduce OmniGen, a new diffusion model
27-
for unified image generation. OmniGen is characterized
28-
by the following features: 1) Unification: OmniGen not
29-
only demonstrates text-to-image generation capabilities but
30-
also inherently supports various downstream tasks, such
31-
as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of
32-
OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion
33-
models, it is more user-friendly and can complete complex
34-
tasks end-to-end through instructions without the need for
35-
extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from
36-
learning in a unified format, OmniGen effectively transfers
37-
knowledge across different tasks, manages unseen tasks and
38-
domains, and exhibits novel capabilities. We also explore
39-
the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism.
40-
This work represents the first attempt at a general-purpose image generation model,
41-
and we will release our resources at https:
42-
//github.com/VectorSpaceLab/OmniGen to foster future advancements.*
22+
*The emergence of Large Language Models (LLMs) has unified language generation tasks and revolutionized human-machine interaction. However, in the realm of image generation, a unified model capable of handling various tasks within a single framework remains largely unexplored. In this work, we introduce OmniGen, a new diffusion model for unified image generation. OmniGen is characterized by the following features: 1) Unification: OmniGen not only demonstrates text-to-image generation capabilities but also inherently supports various downstream tasks, such as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion models, it is more user-friendly and can complete complex tasks end-to-end through instructions without the need for extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from learning in a unified format, OmniGen effectively transfers knowledge across different tasks, manages unseen tasks and domains, and exhibits novel capabilities. We also explore the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https://github.com/VectorSpaceLab/OmniGen to foster future advancements.*
4323

4424
<Tip>
4525

@@ -49,25 +29,22 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m
4929

5030
This pipeline was contributed by [staoxiao](https://github.com/staoxiao). The original codebase can be found [here](https://github.com/VectorSpaceLab/OmniGen). The original weights can be found under [hf.co/shitao](https://huggingface.co/Shitao/OmniGen-v1).
5131

52-
5332
## Inference
5433

5534
First, load the pipeline:
5635

5736
```python
5837
import torch
5938
from diffusers import OmniGenPipeline
60-
pipe = OmniGenPipeline.from_pretrained(
61-
"Shitao/OmniGen-v1-diffusers",
62-
torch_dtype=torch.bfloat16
63-
)
39+
40+
pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16)
6441
pipe.to("cuda")
6542
```
6643

6744
For text-to-image, pass a text prompt. By default, OmniGen generates a 1024x1024 image.
6845
You can try setting the `height` and `width` parameters to generate images with different size.
6946

70-
```py
47+
```python
7148
prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."
7249
image = pipe(
7350
prompt=prompt,
@@ -76,14 +53,14 @@ image = pipe(
7653
guidance_scale=3,
7754
generator=torch.Generator(device="cpu").manual_seed(111),
7855
).images[0]
79-
image
56+
image.save("output.png")
8057
```
8158

8259
OmniGen supports multimodal inputs.
8360
When the input includes an image, you need to add a placeholder `<img><|image_1|></img>` in the text prompt to represent the image.
8461
It is recommended to enable `use_input_image_size_as_output` to keep the edited image the same size as the original image.
8562

86-
```py
63+
```python
8764
prompt="<img><|image_1|></img> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola."
8865
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")]
8966
image = pipe(
@@ -93,14 +70,11 @@ image = pipe(
9370
img_guidance_scale=1.6,
9471
use_input_image_size_as_output=True,
9572
generator=torch.Generator(device="cpu").manual_seed(222)).images[0]
96-
image
73+
image.save("output.png")
9774
```
9875

99-
10076
## OmniGenPipeline
10177

10278
[[autodoc]] OmniGenPipeline
10379
- all
10480
- __call__
105-
106-

docs/source/en/using-diffusers/omnigen.md

+42-39
Original file line numberDiff line numberDiff line change
@@ -19,25 +19,22 @@ For more information, please refer to the [paper](https://arxiv.org/pdf/2409.113
1919
This guide will walk you through using OmniGen for various tasks and use cases.
2020

2121
## Load model checkpoints
22+
2223
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
2324

24-
```py
25+
```python
2526
import torch
2627
from diffusers import OmniGenPipeline
27-
pipe = OmniGenPipeline.from_pretrained(
28-
"Shitao/OmniGen-v1-diffusers",
29-
torch_dtype=torch.bfloat16
30-
)
31-
```
32-
3328

29+
pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16)
30+
```
3431

3532
## Text-to-image
3633

3734
For text-to-image, pass a text prompt. By default, OmniGen generates a 1024x1024 image.
3835
You can try setting the `height` and `width` parameters to generate images with different size.
3936

40-
```py
37+
```python
4138
import torch
4239
from diffusers import OmniGenPipeline
4340

@@ -55,8 +52,9 @@ image = pipe(
5552
guidance_scale=3,
5653
generator=torch.Generator(device="cpu").manual_seed(111),
5754
).images[0]
58-
image
55+
image.save("output.png")
5956
```
57+
6058
<div class="flex justify-center">
6159
<img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png" alt="generated image"/>
6260
</div>
@@ -67,7 +65,7 @@ OmniGen supports multimodal inputs.
6765
When the input includes an image, you need to add a placeholder `<img><|image_1|></img>` in the text prompt to represent the image.
6866
It is recommended to enable `use_input_image_size_as_output` to keep the edited image the same size as the original image.
6967

70-
```py
68+
```python
7169
import torch
7270
from diffusers import OmniGenPipeline
7371
from diffusers.utils import load_image
@@ -86,9 +84,11 @@ image = pipe(
8684
guidance_scale=2,
8785
img_guidance_scale=1.6,
8886
use_input_image_size_as_output=True,
89-
generator=torch.Generator(device="cpu").manual_seed(222)).images[0]
90-
image
87+
generator=torch.Generator(device="cpu").manual_seed(222)
88+
).images[0]
89+
image.save("output.png")
9190
```
91+
9292
<div class="flex flex-row gap-4">
9393
<div class="flex-1">
9494
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png"/>
@@ -101,7 +101,8 @@ image
101101
</div>
102102

103103
OmniGen has some interesting features, such as visual reasoning, as shown in the example below.
104-
```py
104+
105+
```python
105106
prompt="If the woman is thirsty, what should she take? Find it in the image and highlight it in blue. <img><|image_1|></img>"
106107
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
107108
image = pipe(
@@ -110,20 +111,20 @@ image = pipe(
110111
guidance_scale=2,
111112
img_guidance_scale=1.6,
112113
use_input_image_size_as_output=True,
113-
generator=torch.Generator(device="cpu").manual_seed(0)).images[0]
114-
image
114+
generator=torch.Generator(device="cpu").manual_seed(0)
115+
).images[0]
116+
image.save("output.png")
115117
```
118+
116119
<div class="flex justify-center">
117120
<img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/reasoning.png" alt="generated image"/>
118121
</div>
119122

120-
121123
## Controllable generation
122124

123-
OmniGen can handle several classic computer vision tasks.
124-
As shown below, OmniGen can detect human skeletons in input images, which can be used as control conditions to generate new images.
125+
OmniGen can handle several classic computer vision tasks. As shown below, OmniGen can detect human skeletons in input images, which can be used as control conditions to generate new images.
125126

126-
```py
127+
```python
127128
import torch
128129
from diffusers import OmniGenPipeline
129130
from diffusers.utils import load_image
@@ -142,8 +143,9 @@ image1 = pipe(
142143
guidance_scale=2,
143144
img_guidance_scale=1.6,
144145
use_input_image_size_as_output=True,
145-
generator=torch.Generator(device="cpu").manual_seed(333)).images[0]
146-
image1
146+
generator=torch.Generator(device="cpu").manual_seed(333)
147+
).images[0]
148+
image1.save("image1.png")
147149

148150
prompt="Generate a new photo using the following picture and text as conditions: <img><|image_1|></img>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."
149151
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png")]
@@ -153,8 +155,9 @@ image2 = pipe(
153155
guidance_scale=2,
154156
img_guidance_scale=1.6,
155157
use_input_image_size_as_output=True,
156-
generator=torch.Generator(device="cpu").manual_seed(333)).images[0]
157-
image2
158+
generator=torch.Generator(device="cpu").manual_seed(333)
159+
).images[0]
160+
image2.save("image2.png")
158161
```
159162

160163
<div class="flex flex-row gap-4">
@@ -174,7 +177,8 @@ image2
174177

175178

176179
OmniGen can also directly use relevant information from input images to generate new images.
177-
```py
180+
181+
```python
178182
import torch
179183
from diffusers import OmniGenPipeline
180184
from diffusers.utils import load_image
@@ -193,23 +197,24 @@ image = pipe(
193197
guidance_scale=2,
194198
img_guidance_scale=1.6,
195199
use_input_image_size_as_output=True,
196-
generator=torch.Generator(device="cpu").manual_seed(0)).images[0]
197-
image
200+
generator=torch.Generator(device="cpu").manual_seed(0)
201+
).images[0]
202+
image.save("output.png")
198203
```
204+
199205
<div class="flex flex-row gap-4">
200206
<div class="flex-1">
201207
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/same_pose.png"/>
202208
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
203209
</div>
204210
</div>
205211

206-
207212
## ID and object preserving
208213

209214
OmniGen can generate multiple images based on the people and objects in the input image and supports inputting multiple images simultaneously.
210215
Additionally, OmniGen can extract desired objects from an image containing multiple objects based on instructions.
211216

212-
```py
217+
```python
213218
import torch
214219
from diffusers import OmniGenPipeline
215220
from diffusers.utils import load_image
@@ -231,9 +236,11 @@ image = pipe(
231236
width=1024,
232237
guidance_scale=2.5,
233238
img_guidance_scale=1.6,
234-
generator=torch.Generator(device="cpu").manual_seed(666)).images[0]
235-
image
239+
generator=torch.Generator(device="cpu").manual_seed(666)
240+
).images[0]
241+
image.save("output.png")
236242
```
243+
237244
<div class="flex flex-row gap-4">
238245
<div class="flex-1">
239246
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png"/>
@@ -249,7 +256,6 @@ image
249256
</div>
250257
</div>
251258

252-
253259
```py
254260
import torch
255261
from diffusers import OmniGenPipeline
@@ -261,7 +267,6 @@ pipe = OmniGenPipeline.from_pretrained(
261267
)
262268
pipe.to("cuda")
263269

264-
265270
prompt="A woman is walking down the street, wearing a white long-sleeve blouse with lace details on the sleeves, paired with a blue pleated skirt. The woman is <img><|image_1|></img>. The long-sleeve blouse and a pleated skirt are <img><|image_2|></img>."
266271
input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg")
267272
input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg")
@@ -273,8 +278,9 @@ image = pipe(
273278
width=1024,
274279
guidance_scale=2.5,
275280
img_guidance_scale=1.6,
276-
generator=torch.Generator(device="cpu").manual_seed(666)).images[0]
277-
image
281+
generator=torch.Generator(device="cpu").manual_seed(666)
282+
).images[0]
283+
image.save("output.png")
278284
```
279285

280286
<div class="flex flex-row gap-4">
@@ -292,13 +298,12 @@ image
292298
</div>
293299
</div>
294300

295-
296-
## Optimization when inputting multiple images
301+
## Optimization when using multiple images
297302

298303
For text-to-image task, OmniGen requires minimal memory and time costs (9GB memory and 31s for a 1024x1024 image on A800 GPU).
299304
However, when using input images, the computational cost increases.
300305

301-
Here are some guidelines to help you reduce computational costs when inputting multiple images. The experiments are conducted on an A800 GPU with two input images.
306+
Here are some guidelines to help you reduce computational costs when using multiple images. The experiments are conducted on an A800 GPU with two input images.
302307

303308
Like other pipelines, you can reduce memory usage by offloading the model: `pipe.enable_model_cpu_offload()` or `pipe.enable_sequential_cpu_offload() `.
304309
In OmniGen, you can also decrease computational overhead by reducing the `max_input_image_size`.
@@ -310,5 +315,3 @@ The memory consumption for different image sizes is shown in the table below:
310315
| max_input_image_size=512 | 17GB |
311316
| max_input_image_size=256 | 14GB |
312317

313-
314-

src/diffusers/models/embeddings.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1199,7 +1199,7 @@ def apply_rotary_emb(
11991199
x_real, x_imag = x.reshape(*x.shape[:-1], -1, 2).unbind(-1) # [B, S, H, D//2]
12001200
x_rotated = torch.stack([-x_imag, x_real], dim=-1).flatten(3)
12011201
elif use_real_unbind_dim == -2:
1202-
# Used for Stable Audio
1202+
# Used for Stable Audio and OmniGen
12031203
x_real, x_imag = x.reshape(*x.shape[:-1], 2, -1).unbind(-2) # [B, S, H, D//2]
12041204
x_rotated = torch.cat([-x_imag, x_real], dim=-1)
12051205
else:

0 commit comments

Comments
 (0)