Skip to content

Commit 64dec70

Browse files
stevhliusayakpaul
andauthored
[docs] LoRA support (#10844)
* lora * update * update --------- Co-authored-by: Sayak Paul <[email protected]>
1 parent ffb6777 commit 64dec70

39 files changed

+156
-0
lines changed

docs/source/en/api/pipelines/animatediff.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Text-to-Video Generation with AnimateDiff
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
## Overview
1620

1721
[AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai.

docs/source/en/api/pipelines/cogvideox.md

+4
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@
1515

1616
# CogVideoX
1717

18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
20+
</div>
21+
1822
[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.
1923

2024
The abstract from the paper is:

docs/source/en/api/pipelines/consisid.md

+4
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@
1515

1616
# ConsisID
1717

18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
20+
</div>
21+
1822
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.
1923

2024
The abstract from the paper is:

docs/source/en/api/pipelines/control_flux_inpaint.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# FluxControlInpaint
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
FluxControlInpaintPipeline is an implementation of Inpainting for Flux.1 Depth/Canny models. It is a pipeline that allows you to inpaint images using the Flux.1 Depth/Canny models. The pipeline takes an image and a mask as input and returns the inpainted image.
1620

1721
FLUX.1 Depth and Canny [dev] is a 12 billion parameter rectified flow transformer capable of generating an image based on a text description while following the structure of a given input image. **This is not a ControlNet model**.

docs/source/en/api/pipelines/controlnet.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# ControlNet
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
1620

1721
With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.

docs/source/en/api/pipelines/controlnet_flux.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# ControlNet with Flux.1
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
FluxControlNetPipeline is an implementation of ControlNet for Flux.1.
1620

1721
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.

docs/source/en/api/pipelines/controlnet_sd3.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# ControlNet with Stable Diffusion 3
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
StableDiffusion3ControlNetPipeline is an implementation of ControlNet for Stable Diffusion 3.
1620

1721
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.

docs/source/en/api/pipelines/controlnet_sdxl.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# ControlNet with Stable Diffusion XL
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.
1620

1721
With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.

docs/source/en/api/pipelines/controlnet_union.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# ControlNetUnion
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
ControlNetUnionModel is an implementation of ControlNet for Stable Diffusion XL.
1620

1721
The ControlNet model was introduced in [ControlNetPlus](https://github.com/xinsir6/ControlNetPlus) by xinsir6. It supports multiple conditioning inputs without increasing computation.

docs/source/en/api/pipelines/controlnetxs.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# ControlNet-XS
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
ControlNet-XS was introduced in [ControlNet-XS](https://vislearn.github.io/ControlNet-XS/) by Denis Zavadski and Carsten Rother. It is based on the observation that the control model in the [original ControlNet](https://huggingface.co/papers/2302.05543) can be made much smaller and still produce good results.
1620

1721
Like the original ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process.

docs/source/en/api/pipelines/deepfloyd_if.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# DeepFloyd IF
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
## Overview
1620

1721
DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding.

docs/source/en/api/pipelines/flux.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Flux
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
Flux is a series of text-to-image generation models based on diffusion transformers. To know more about Flux, check out the original [blog post](https://blackforestlabs.ai/announcing-black-forest-labs/) by the creators of Flux, Black Forest Labs.
1620

1721
Original model checkpoints for Flux can be found [here](https://huggingface.co/black-forest-labs). Original inference code can be found [here](https://github.com/black-forest-labs/flux).

docs/source/en/api/pipelines/hunyuan_video.md

+4
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414

1515
# HunyuanVideo
1616

17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</div>
20+
1721
[HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.
1822

1923
*Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/tencent/HunyuanVideo).*

docs/source/en/api/pipelines/kandinsky3.md

+4
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,10 @@ specific language governing permissions and limitations under the License.
99

1010
# Kandinsky 3
1111

12+
<div class="flex flex-wrap space-x-1">
13+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
14+
</div>
15+
1216
Kandinsky 3 is created by [Vladimir Arkhipkin](https://github.com/oriBetelgeuse),[Anastasia Maltseva](https://github.com/NastyaMittseva),[Igor Pavlov](https://github.com/boomb0om),[Andrei Filatov](https://github.com/anvilarth),[Arseniy Shakhmatov](https://github.com/cene555),[Andrey Kuznetsov](https://github.com/kuznetsoffandrey),[Denis Dimitrov](https://github.com/denndimitrov), [Zein Shaheen](https://github.com/zeinsh)
1317

1418
The description from it's GitHub page:

docs/source/en/api/pipelines/kolors.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/kolors/kolors_header_collage.png)
1620

1721
Kolors is a large-scale text-to-image generation model based on latent diffusion, developed by [the Kuaishou Kolors team](https://github.com/Kwai-Kolors/Kolors). Trained on billions of text-image pairs, Kolors exhibits significant advantages over both open-source and closed-source models in visual quality, complex semantic accuracy, and text rendering for both Chinese and English characters. Furthermore, Kolors supports both Chinese and English inputs, demonstrating strong performance in understanding and generating Chinese-specific content. For more details, please refer to this [technical report](https://github.com/Kwai-Kolors/Kolors/blob/master/imgs/Kolors_paper.pdf).

docs/source/en/api/pipelines/latent_consistency_models.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Latent Consistency Models
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://huggingface.co/papers/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao.
1620

1721
The abstract of the paper is as follows:

docs/source/en/api/pipelines/ledits_pp.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# LEDITS++
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
LEDITS++ was proposed in [LEDITS++: Limitless Image Editing using Text-to-Image Models](https://huggingface.co/papers/2311.16711) by Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos.
1620

1721
The abstract from the paper is:

docs/source/en/api/pipelines/ltx_video.md

+4
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414

1515
# LTX Video
1616

17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</div>
20+
1721
[LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
1822

1923
<Tip>

docs/source/en/api/pipelines/lumina2.md

+4
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414

1515
# Lumina2
1616

17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</div>
20+
1721
[Lumina Image 2.0: A Unified and Efficient Image Generative Model](https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0) is a 2 billion parameter flow-based diffusion transformer capable of generating diverse images from text descriptions.
1822

1923
The abstract from the paper is:

docs/source/en/api/pipelines/mochi.md

+4
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@
1515

1616
# Mochi 1 Preview
1717

18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
20+
</div>
21+
1822
> [!TIP]
1923
> Only a research preview of the model weights is available at the moment.
2024

docs/source/en/api/pipelines/pag.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Perturbed-Attention Guidance
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
[Perturbed-Attention Guidance (PAG)](https://ku-cvlab.github.io/Perturbed-Attention-Guidance/) is a new diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules.
1620

1721
PAG was introduced in [Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance](https://huggingface.co/papers/2403.17377) by Donghoon Ahn, Hyoungwon Cho, Jaewon Min, Wooseok Jang, Jungwoo Kim, SeonHwa Kim, Hyun Hee Park, Kyong Hwan Jin and Seungryong Kim.

docs/source/en/api/pipelines/panorama.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# MultiDiffusion
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
[MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation](https://huggingface.co/papers/2302.08113) is by Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel.
1620

1721
The abstract from the paper is:

docs/source/en/api/pipelines/pia.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Image-to-Video Generation with PIA (Personalized Image Animator)
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
## Overview
1620

1721
[PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models](https://arxiv.org/abs/2312.13964) by Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing Fang, Kai Chen

docs/source/en/api/pipelines/pix2pix.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# InstructPix2Pix
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
[InstructPix2Pix: Learning to Follow Image Editing Instructions](https://huggingface.co/papers/2211.09800) is by Tim Brooks, Aleksander Holynski and Alexei A. Efros.
1620

1721
The abstract from the paper is:

docs/source/en/api/pipelines/sana.md

+4
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@
1414

1515
# SanaPipeline
1616

17+
<div class="flex flex-wrap space-x-1">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</div>
20+
1721
[SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers](https://huggingface.co/papers/2410.10629) from NVIDIA and MIT HAN Lab, by Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, Song Han.
1822

1923
The abstract from the paper is:

docs/source/en/api/pipelines/stable_diffusion/depth2img.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Depth-to-image
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure.
1620

1721
<Tip>

docs/source/en/api/pipelines/stable_diffusion/img2img.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Image-to-image
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
The Stable Diffusion model can also be applied to image-to-image generation by passing a text prompt and an initial image to condition the generation of new images.
1620

1721
The [`StableDiffusionImg2ImgPipeline`] uses the diffusion-denoising mechanism proposed in [SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations](https://huggingface.co/papers/2108.01073) by Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon.

docs/source/en/api/pipelines/stable_diffusion/inpaint.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Inpainting
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
The Stable Diffusion model can also be applied to inpainting which lets you edit specific parts of an image by providing a mask and a text prompt using Stable Diffusion.
1620

1721
## Tips

docs/source/en/api/pipelines/stable_diffusion/ldm3d_diffusion.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Text-to-(RGB, depth)
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
LDM3D was proposed in [LDM3D: Latent Diffusion Model for 3D](https://huggingface.co/papers/2305.10853) by Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, and Vasudev Lal. LDM3D generates an image and a depth map from a given text prompt unlike the existing text-to-image diffusion models such as [Stable Diffusion](./overview) which only generates an image. With almost the same number of parameters, LDM3D achieves to create a latent space that can compress both the RGB images and the depth maps.
1620

1721
Two checkpoints are available for use:

docs/source/en/api/pipelines/stable_diffusion/overview.md

+4
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ specific language governing permissions and limitations under the License.
1212

1313
# Stable Diffusion pipelines
1414

15+
<div class="flex flex-wrap space-x-1">
16+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
17+
</div>
18+
1519
Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from [CompVis](https://github.com/CompVis), [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). Latent diffusion applies the diffusion process over a lower dimensional latent space to reduce memory and compute complexity. This specific type of diffusion model was proposed in [High-Resolution Image Synthesis with Latent Diffusion Models](https://huggingface.co/papers/2112.10752) by Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer.
1620

1721
Stable Diffusion is trained on 512x512 images from a subset of the LAION-5B dataset. This model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and can run on consumer GPUs.

0 commit comments

Comments
 (0)