Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting CUDA out of memory error even with Colab A100 high RAM #11014

Open
gurselnaziroglu opened this issue Mar 9, 2025 · 5 comments
Open
Labels
bug Something isn't working

Comments

@gurselnaziroglu
Copy link

Describe the bug

I am trying to fine-tune Flux.1-dev with Lora on the Google Colab A100 runtime environment. It has 80 GB system RAM and 40 GB VRAM. I followed the recommended steps from this link. I am still getting "CUDA out of memory" error. I saw that related old issues were closed, but the bug seems to be still present.

Reproduction

Here is the last version that I tried. It is with AdamQ. I also tried using Prodigy optimizer and got the same error.

!accelerate launch -q train_dreambooth_lora_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--instance_data_dir="train_photos"
--output_dir="trained-flux-lora"
--mixed_precision="bf16"
--instance_prompt="a photo of X"
--resolution=512
--rank=1
--train_batch_size=1
--guidance_scale=1
--gradient_accumulation_steps=4
--optimizer="AdamW"
--learning_rate=1.
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=500
--validation_prompt="A photo of X biking"
--validation_epochs=25
--seed="0"
--lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0"
--use_8bit_adam

Logs

All the weights of FluxTransformer2DModel were initialized from the model checkpoint at black-forest-labs/FLUX.1-dev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use FluxTransformer2DModel for predictions without further training.
03/09/2025 15:45:57 - INFO - __main__ - ***** Running training *****
03/09/2025 15:45:57 - INFO - __main__ -   Num examples = 5
03/09/2025 15:45:57 - INFO - __main__ -   Num batches each epoch = 5
03/09/2025 15:45:57 - INFO - __main__ -   Num Epochs = 250
03/09/2025 15:45:57 - INFO - __main__ -   Instantaneous batch size per device = 1
03/09/2025 15:45:57 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
03/09/2025 15:45:57 - INFO - __main__ -   Gradient Accumulation steps = 4
03/09/2025 15:45:57 - INFO - __main__ -   Total optimization steps = 500
Steps:   0% 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/train_dreambooth_lora_flux.py", line 1926, in <module>
    main(args)
  File "/content/train_dreambooth_lora_flux.py", line 1720, in main
    model_pred = transformer(
                 ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/diffusers/models/transformers/transformer_flux.py", line 523, in forward
    hidden_states = block(
                    ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/diffusers/models/transformers/transformer_flux.py", line 96, in forward
    hidden_states = torch.cat([attn_output, mlp_hidden_states], dim=2)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 46.88 MiB is free. Process 68220 has 39.50 GiB memory in use. Of the allocated memory 38.84 GiB is allocated by PyTorch, and 172.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Steps:   0% 0/500 [00:01<?, ?it/s]

System Info

Google Colab A100 runtime environment

Who can help?

No response

@gurselnaziroglu gurselnaziroglu added the bug Something isn't working label Mar 9, 2025
@Aravind-11
Copy link

I think the model is just too big for fine tuning and the attempts on the parameters to make it fit on the A100 gpu is too tight. It would be better just to use a smaller model.

@a-r-r-o-w
Copy link
Member

@Aravind-11 It's possible to fully finetune flux under 24 GB, and do lora finetuning in under 8 GB or lower with clever offloading and other techniques.

@gurselnaziroglu Could you try with --gradient_checkpointing?

@gurselnaziroglu
Copy link
Author

@Aravind-11 It's possible to fully finetune flux under 24 GB, and do lora finetuning in under 8 GB or lower with clever offloading and other techniques.

@gurselnaziroglu Could you try with --gradient_checkpointing?

Thanks for the suggestion. I tried it, but it didn't work. I'm still getting the same error in the same position. I would appreciate any further suggestions.

@maosuli
Copy link

maosuli commented Apr 11, 2025

Got the same issues on "torch.OutOfMemoryError: CUDA out of memory" when loading the flux_transformer into the 32GB GPU.

@asomoza
Copy link
Member

asomoza commented Apr 11, 2025

Hi, I tried with your command but with the dog dataset which are a few images only, with what you're using it requires this:

~71 GB RAM
~42.9 GB VRAM

So first, there's no way to load it like this in a 32GB or a 40GB VRAM GPU, you will always get OOM, also with validation it will use even more, I tested it with a L40S which has 45GB and it barely fits in it without validation.

Also this is not a BUG, it clearly says in the README that you will need more than a 40GB VRAM GPU, so if you want to train Flux like this, you have these options:

  • Cache the embeddings before so you don't need to load the text encoders and the VAE, also don't use validation.
  • Use a bigger GPU
  • Use quantization

Probably your best solution here is to use a library that's just specially made for training which probably has more options to lower the VRAM, if you try to do it here, you will need to do some work (coding) to get it to run on something that is not an A100, H100 or superior GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants