add: train to text image with sdxl script. #4505

sayakpaul · 2023-08-07T05:37:20Z

Closes #4366 and builds on top of #4401.

Many thanks to @CaptnSeraph for laying out the foundations here.

To test:

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE \
  --dataset_name=$DATASET_NAME \
  --enable_xformers_memory_efficient_attention \
  --resolution=512 --center_crop --random_flip \
  --proportion_empty_prompts=0.2 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --max_train_steps=15000 \
  --use_8bit_adam \
  --learning_rate=1e-05 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --report_to="wandb" \
  --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 \
  --checkpointing_steps=5000 \
  --output_dir="sdxl-pokemon-model" \
  --push_to_hub

TODOs

Tests
Docs
Share results

Co-authored-by: CaptnSeraph <[email protected]>

HuggingFaceDocBuilderDev · 2023-08-07T05:45:39Z

The documentation is not available anymore as the PR was closed or merged.

examples/text_to_image/train_text_to_image_sdxl.py

bghira · 2023-08-07T05:55:59Z

examples/text_to_image/train_text_to_image_sdxl.py

+            torch_dtype=weight_dtype,
+        )
+        pipeline = StableDiffusionXLPipeline.from_pretrained(
+            args.pretrained_model_name_or_path, unet=unet, vae=vae, revision=args.revision, torch_dtype=weight_dtype


can use precomputed embeds here

sorry, i pinged the wrong line. when we run validations, we don't need to load the text encoder onto GPU.

We don't support such device placements when initializing and using a pipeline. When we call to() on a pipeline, all the nn.Module components are placed on the same device.

yeah! i just modified my local copy to accept None for text encoders, similar to how, the Kandinsky and DeepFloyd pipelines work.

examples/text_to_image/train_text_to_image_sdxl.py

bghira · 2023-08-07T05:59:17Z

you might want to precompute the latents, they don't take much more room and the use of VAE during tuning really hampers max batch size. in fact, the text embeds take a lot more disk space than the latents.

examples/text_to_image/train_text_to_image_sdxl.py

bghira · 2023-08-07T06:04:02Z

examples/text_to_image/train_text_to_image_sdxl.py

+                y1, x1, h, w = train_crop.get_params(image, (args.resolution, args.resolution))
+                image = crop(image, y1, x1, h, w)


nice! that may save some VRAM. but my approach was to go with data bucketing. i don't notice much issue with 1536x1024 training data at batch size 10 with the #4474 fix

bghira · 2023-08-07T06:04:38Z

examples/text_to_image/train_text_to_image_sdxl.py

+            crop_top_left = (y1, x1)
+            crop_top_lefts.append(crop_top_left)
+            image = train_transforms(image)
+            all_images.append(image)


ouch, this might run out of memory.

#4505 (comment)

bghira · 2023-08-07T06:07:26Z

examples/text_to_image/train_text_to_image_sdxl.py

+        # fingerprint used by the cache for the other processes to load the result
+        # details: https://github.com/huggingface/diffusers/pull/4038#discussion_r1266078401
+        new_fingerprint = Hasher.hash(args)
+        train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint)


if we're not writing these to disk, that's also a lot of memory to consume.

on a 138,000 image dataset i've processed, it uses 44G of system memory for text embeds, and 20GB of memory for the VAE latents. it's really not viable to hold them all in memory for large jobs.

In that case, one would serialize those, correct. But I would like to bring your attention to this comment here :-)

#4505 (comment)

sayakpaul · 2023-08-07T06:07:27Z

@bghira thanks for suggestions. For v1, I would like to keep things as they are. While your suggestions are pretty nice, we, as maintainers of the library, follow our guidelines here: https://huggingface.co/docs/diffusers/main/en/training/overview.

So, this means, in many cases, we will prioritize simplicity over too much exhaustivity. This is why, we try to keep the scripts simple enough so that others can easily customize them as per their needs.

examples/text_to_image/train_text_to_image_sdxl.py

bghira · 2023-08-07T06:27:29Z

i appreciate simplicity, but it gimps the utility of the script to a degree of requiring users to make extensive changes for it. as i had a feeling this would happen, it's why i didn't bother doing the work of writing the PR, instead opting to implement all of my suggestions in https://github.com/bghira/SimpleTuner.

sayakpaul · 2023-08-07T07:17:48Z

@bghira let's revisit some of your suggestions here, as on a retrospect, I think they make a lot of sense :-)

on a 138,000 image dataset i've processed, it uses 44G of system memory for text embeds, and 20GB of memory for the VAE latents. it's really not viable to hold them all in memory for large jobs.

What would you recommend here?

bghira · 2023-08-07T07:52:47Z

i have a vae_cache folder in my implementation that i write .pt files for all of the embeds, and the encode function reads from disk if it is there, instead of computing it.

bghira · 2023-08-07T07:54:09Z

here is a naive implementation that ~~uses multiprocessing~~ (lol, it does not, i was thinking of my data loader) but it even has a progress bar! :D

import hashlib, os, torch, logging
from tqdm import tqdm
from PIL import Image
import torchvision.transforms as transforms

logger = logging.getLogger("VAECache")
logger.setLevel("INFO")


class VAECache:
    def __init__(self, vae, accelerator, cache_dir="vae_cache", resolution: int = 1024):
        self.vae = vae
        self.vae.enable_slicing()
        self.accelerator = accelerator
        self.cache_dir = cache_dir
        self.resolution = resolution
        os.makedirs(self.cache_dir, exist_ok=True)

    def create_hash(self, filename):
        # Create a sha256 hash
        sha256_hash = hashlib.sha256()

        # Feed the hash function with the filename
        sha256_hash.update(filename.encode())

        # Get the hexadecimal representation of the hash
        return sha256_hash.hexdigest()

    def save_to_cache(self, filename, embeddings):
        torch.save(embeddings, filename)

    def load_from_cache(self, filename):
        return torch.load(filename)

    def encode_image(self, pixel_values, filepath: str):
        file_hash = self.create_hash(filepath)
        filename = os.path.join(self.cache_dir, file_hash + ".pt")
        logger.debug(f'Created file_hash {file_hash} from filepath {filepath} for resulting .pt filename.')
        if os.path.exists(filename):
            latents = self.load_from_cache(filename)
            logger.debug(
                f"Loading latents of shape {latents.shape} from existing cache file: {filename}"
            )
        else:
            with torch.no_grad():
                latents = self.vae.encode(
                    pixel_values.unsqueeze(0).to(
                        self.accelerator.device, dtype=torch.bfloat16
                    )
                ).latent_dist.sample()
                logger.debug(
                    f"Using shape {latents.shape}, creating new latent cache: {filename}"
                )
            latents = latents * self.vae.config.scaling_factor
            logger.debug(f"Latent shape after re-scale: {latents.shape}")
            self.save_to_cache(filename, latents.squeeze())

        output_latents = latents.squeeze().to(
            self.accelerator.device, dtype=self.vae.dtype
        )
        logger.debug(f"Output latents shape: {output_latents.shape}")
        return output_latents

    def process_directory(self, directory):
        # Define a transform to convert the image to tensor
        transform = transforms.ToTensor()

        # Get a list of all the files to process (customize as needed)
        files_to_process = []
        logger.debug(f"Beginning processing of VAECache directory {directory}")
        for subdir, _, files in os.walk(directory):
            for file in files:
                if file.endswith((".png", ".jpg", ".jpeg")):
                    logger.debug(f"Discovered image: {os.path.join(subdir, file)}")
                    files_to_process.append(os.path.join(subdir, file))

        # Iterate through the files, displaying a progress bar
        for filepath in tqdm(files_to_process, desc="Processing images"):
            # Create a hash based on the filename
            file_hash = self.create_hash(filepath)
            filename = os.path.join(self.cache_dir, file_hash + ".pt")

            # If processed file already exists, skip processing for this image
            if os.path.exists(filename):
                logger.debug(
                    f"Skipping processing for {filepath} as cached file {filename} already exists."
                )
                continue

            # Open the image using PIL
            try:
                logger.debug(f"Loading image: {filepath}")
                image = Image.open(filepath)
                image = image.convert("RGB")
                image = self._resize_for_condition_image(image, self.resolution)
            except Exception as e:
                logger.error(f"Encountered error opening image: {e}")
                os.remove(filepath)
                continue

            # Convert the image to a tensor
            try:
                pixel_values = transform(image).to(
                    self.accelerator.device, dtype=self.vae.dtype
                )
            except OSError as e:
                logger.error(f"Encountered error converting image to tensor: {e}")
                continue

            # Process the image with the VAE
            self.encode_image(pixel_values, filepath)

            logger.debug(f"Processed image {filepath}")

    def _resize_for_condition_image(self, input_image: Image, resolution: int):
        input_image = input_image.convert("RGB")
        W, H = input_image.size
        aspect_ratio = round(W / H, 3)
        msg = f"Inspecting image of aspect {aspect_ratio} and size {W}x{H} to "
        if W < H:
            W = resolution
            H = int(resolution / aspect_ratio)  # Calculate the new height
        elif H < W:
            H = resolution
            W = int(resolution * aspect_ratio)  # Calculate the new width
        if W == H:
            W = resolution
            H = resolution
        msg = f"{msg} {W}x{H}."
        logger.debug(msg)
        img = input_image.resize((W, H), resample=Image.BICUBIC)
        return img

sayakpaul · 2023-08-07T08:15:16Z

Thanks for providing the snippets! If we compute the VAE encodings like this, then it creates a problem during the batch preparation, as the images are no longer of uniform shape. I guess we need to also apply a cropping (of resolution shape) here, no?

Also, for my own understanding:

At this stage of pre-computing, I think we also need to store the original_size and crop_top_lefts information in the .pt files.
What about the pre-computing the text embeddings too? How would you suggest approaching it?

sayakpaul · 2023-08-10T15:18:42Z

As it is now, with my repository's approach I can do multi-aspect base-1024px images that stretch up to 1.5 megapixel, on a 48G GPU with batch size 4.

Elaborate on one part. When you say multi-aspect 1024px, do you mean multiple aspect ratios while keeping the base resolution to 1024?

bghira · 2023-08-10T15:27:14Z

i use the condition resize method that keeps images at 64px step increments, and resize smaller edge to 1024px.

i then factor 1024 by the image aspect ratio to get the other side's length. by doing it this way, an entire batch will have the same pixel-perfect sizing.

i'm not doing what StabilityAI did, which is to precompute 1 megapixel resolutions at various aspect ratios, and then conform the images to that. my images can go quite large. this is because the VAE was the true limiting factor there.

sayakpaul · 2023-08-10T15:31:31Z

Thanks for explaining! Do you have some code reference for me? Would love to understand and visualize this preferably in a notebook. Could be a valuable resource to the community!

bghira · 2023-08-10T15:35:31Z

it has some problems, and the code is more convoluted than i'd like. it can be greatly simplified, but here is what i've been using.

a notable difference between my implementation and others is that i don't have random sampling of images. we do them once and put them into a 'seen' list, and then don't reference those images again until the next epoch. i noted the Diffusers sampler defaults often tend to over-sample images.

sayakpaul · 2023-08-10T16:36:00Z

a notable difference between my implementation and others is that i don't have random sampling of images. we do them once and put them into a 'seen' list, and then don't reference those images again until the next epoch. i noted the Diffusers sampler defaults often tend to over-sample images.

Elaborate a bit? How crucial is this for impacting / degrading quality? Have you experienced it?

Thanks for providing the reference!

bghira · 2023-08-10T16:42:15Z

I have experienced it.

oversampling on some images over others can lead to an uneven distribution of timesteps per image, which results in some of the training data overfitting and others, underfitting.

this is exacerbated with aspect buckets because you might have an uneven distribution of images in each. and random sampling of aspects then occurs on top of random sampling of images, resulting in potentially not ever seeing some of the training data.

it depends on how large your dataset is, and how much time you can really dedicate to the task. it is made worse when tuning the text encoder alongside the u-net, especially because you're not doing caption dropout. things are more likely to overfit on captions as well as image features.

bghira · 2023-08-10T16:45:56Z

a large multi-aspect dataset from LAION might have more than 60% of the images in the 1.0 square aspect.

so by ensuring you can safely sequentially sample each bucket, the chronically underfilled ones are sampled entirely, and the bulk of remaining training time is on the majority of square images. you could slice the buckets so that they're all evenly filled.

but a colleague ( @kaibioinfo ) mentioned an interesting idea where we could crop images down and train on tiled versions of high-res images with their complete coordinates available as conditioning inputs.

honestly, a lot of my problems with data bucketing could be resolved through clever utilisation of cropping.

sayakpaul · 2023-08-10T16:51:21Z

I think all of this could make an interesting utility repository for different dataloaders for SD training haha. Thanks for sharing your experience and wisdom.

bghira · 2023-08-10T16:51:59Z

there's so much to test and so little GPU hours to go around 😂 thanks for being receptive to these changes

bghira · 2023-08-10T17:21:22Z

another idea i had was an aspect bucket for each base resolution 256, 512, 768 and 1024 so that we can make the best use of SDXL's conditioning values, and opening the training data pool up in a massive way.

Co-authored-by: bghira <[email protected]>

yiyixuxu

thanks!

AmericanPresidentJimmyCarter · 2023-08-16T13:25:23Z

I think maybe you should leave the precomputation steps in a community examples section and allow the current training script to use them. Precomputation of embeds and latents is something I do to finetune most models I work on, not just SDXL, so for me a more general solution in a training script that leverages the ability to directly use precomputed embeds/latents is useful. It could be a CLI option for the training script, --precomputed-text-embeds and --precompute-image-latents which skips the loading of the text encoder and VAE and pulls them from the dataset instead.

little-misfit · 2024-04-26T07:32:46Z

even if not precomputed, keeping the late

So, you mean for the first epoch we first generate and save them and free th VAE. For the subsequent epochs we read from the disk?

hi, I want to know if precomputed VAE embedding will lead to the inability to use image data augmentation (because the precomputed VAE locks the content of the image pixels), thanks :)

* add: train to text image with sdxl script. Co-authored-by: CaptnSeraph <[email protected]> * fix: partial func. * fix: default value of output_dir. * make style * set num inference steps to 25. * remove mentions of LoRA. * up min version * add: ema cli arg * run device placement while running step. * precompute vae encodings too. * fix * debug * should work now. * debug * debug * goes alright? * style * debugging * debugging * debugging * debugging * fix * reinit scheduler if prediction_type was passed. * akways cast vae in float32 * better handling of snr. Co-authored-by: bghira <[email protected]> * the vae should be also passed * add: docs. * add: sdlx t2i tests * save the pipeline * autocast. * fix: save_model_card * fix: save_model_card. --------- Co-authored-by: CaptnSeraph <[email protected]> Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: bghira <[email protected]>

cfeng16 · 2024-09-29T19:42:56Z

thanks for your efforts! just wondering if i want to finetune SDXL on a large dataset (say 400k), would the bahavior of precomputing embedding and saving in memory cause memory issue? how can i deal with it?

add: train to text image with sdxl script.

0b7fe23

Co-authored-by: CaptnSeraph <[email protected]>

sayakpaul mentioned this pull request Aug 7, 2023

SDXL text to image trainer #4401

Closed

fix: partial func.

fb14d0d

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Outdated Show resolved Hide resolved

sayakpaul added 3 commits August 7, 2023 11:18

fix: default value of output_dir.

47ca92f

make style

9edca86

set num inference steps to 25.

166bc1d

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Show resolved Hide resolved

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Outdated Show resolved Hide resolved

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Show resolved Hide resolved

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Show resolved Hide resolved

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Outdated Show resolved Hide resolved

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Outdated Show resolved Hide resolved

sayakpaul added 2 commits August 7, 2023 11:32

remove mentions of LoRA.

f4030d0

up min version

2aa700d

bghira reviewed Aug 7, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_sdxl.py Show resolved Hide resolved

add: ema cli arg

ba674d4

run device placement while running step.

2d24cf7

sayakpaul and others added 4 commits August 11, 2023 10:14

better handling of snr.

41a2580

Co-authored-by: bghira <[email protected]>

the vae should be also passed

a4785bf

add: docs.

1c17f3d

add: sdlx t2i tests

0347176

sayakpaul marked this pull request as ready for review August 11, 2023 12:10

sayakpaul requested review from williamberman and yiyixuxu August 11, 2023 12:11

sayakpaul added 6 commits August 11, 2023 17:57

save the pipeline

a2b7e8b

Merge branch 'main' into feat/training-sdxl-text-to-image

6053c51

autocast.

314523e

fix: save_model_card

273b36b

Merge branch 'main' into feat/training-sdxl-text-to-image

1fcaf9d

fix: save_model_card.

d64ff81

yiyixuxu approved these changes Aug 15, 2023

View reviewed changes

sayakpaul merged commit 5175d3d into main Aug 16, 2023

sayakpaul deleted the feat/training-sdxl-text-to-image branch August 16, 2023 03:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add: train to text image with sdxl script. #4505

add: train to text image with sdxl script. #4505

sayakpaul commented Aug 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 7, 2023 •

edited

Loading

bghira Aug 7, 2023

sayakpaul Aug 7, 2023

bghira Aug 7, 2023

sayakpaul Aug 7, 2023 •

edited

Loading

bghira Aug 7, 2023

bghira commented Aug 7, 2023

bghira Aug 7, 2023

bghira Aug 7, 2023

sayakpaul Aug 7, 2023

bghira Aug 7, 2023

sayakpaul Aug 7, 2023

sayakpaul commented Aug 7, 2023

bghira commented Aug 7, 2023 •

edited

Loading

sayakpaul commented Aug 7, 2023

bghira commented Aug 7, 2023

bghira commented Aug 7, 2023 •

edited

Loading

sayakpaul commented Aug 7, 2023

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023 •

edited

Loading

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023 •

edited

Loading

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023

bghira commented Aug 10, 2023

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023

bghira commented Aug 10, 2023

yiyixuxu left a comment

AmericanPresidentJimmyCarter commented Aug 16, 2023

little-misfit commented Apr 26, 2024

cfeng16 commented Sep 29, 2024

		y1, x1, h, w = train_crop.get_params(image, (args.resolution, args.resolution))
		image = crop(image, y1, x1, h, w)

add: train to text image with sdxl script. #4505

add: train to text image with sdxl script. #4505

Conversation

sayakpaul commented Aug 7, 2023 • edited Loading

TODOs

HuggingFaceDocBuilderDev commented Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul Aug 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bghira commented Aug 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul commented Aug 7, 2023

bghira commented Aug 7, 2023 • edited Loading

sayakpaul commented Aug 7, 2023

bghira commented Aug 7, 2023

bghira commented Aug 7, 2023 • edited Loading

sayakpaul commented Aug 7, 2023

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023 • edited Loading

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023 • edited Loading

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023

bghira commented Aug 10, 2023

sayakpaul commented Aug 10, 2023

bghira commented Aug 10, 2023

bghira commented Aug 10, 2023

yiyixuxu left a comment

Choose a reason for hiding this comment

AmericanPresidentJimmyCarter commented Aug 16, 2023

little-misfit commented Apr 26, 2024

cfeng16 commented Sep 29, 2024

sayakpaul commented Aug 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 7, 2023 •

edited

Loading

sayakpaul Aug 7, 2023 •

edited

Loading

bghira commented Aug 7, 2023 •

edited

Loading

bghira commented Aug 7, 2023 •

edited

Loading

bghira commented Aug 10, 2023 •

edited

Loading

bghira commented Aug 10, 2023 •

edited

Loading