Getting CUDA out of memory error even with Colab A100 high RAM #11014

gurselnaziroglu · 2025-03-09T16:08:03Z

Describe the bug

I am trying to fine-tune Flux.1-dev with Lora on the Google Colab A100 runtime environment. It has 80 GB system RAM and 40 GB VRAM. I followed the recommended steps from this link. I am still getting "CUDA out of memory" error. I saw that related old issues were closed, but the bug seems to be still present.

Reproduction

Here is the last version that I tried. It is with AdamQ. I also tried using Prodigy optimizer and got the same error.

!accelerate launch -q train_dreambooth_lora_flux.py
--pretrained_model_name_or_path="black-forest-labs/FLUX.1-dev"
--instance_data_dir="train_photos"
--output_dir="trained-flux-lora"
--mixed_precision="bf16"
--instance_prompt="a photo of X"
--resolution=512
--rank=1
--train_batch_size=1
--guidance_scale=1
--gradient_accumulation_steps=4
--optimizer="AdamW"
--learning_rate=1.
--lr_scheduler="constant"
--lr_warmup_steps=0
--max_train_steps=500
--validation_prompt="A photo of X biking"
--validation_epochs=25
--seed="0"
--lora_layers="attn.to_k,attn.to_q,attn.to_v,attn.to_out.0"
--use_8bit_adam

Logs

All the weights of FluxTransformer2DModel were initialized from the model checkpoint at black-forest-labs/FLUX.1-dev.
If your task is similar to the task the model of the checkpoint was trained on, you can already use FluxTransformer2DModel for predictions without further training.
03/09/2025 15:45:57 - INFO - __main__ - ***** Running training *****
03/09/2025 15:45:57 - INFO - __main__ -   Num examples = 5
03/09/2025 15:45:57 - INFO - __main__ -   Num batches each epoch = 5
03/09/2025 15:45:57 - INFO - __main__ -   Num Epochs = 250
03/09/2025 15:45:57 - INFO - __main__ -   Instantaneous batch size per device = 1
03/09/2025 15:45:57 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
03/09/2025 15:45:57 - INFO - __main__ -   Gradient Accumulation steps = 4
03/09/2025 15:45:57 - INFO - __main__ -   Total optimization steps = 500
Steps:   0% 0/500 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/train_dreambooth_lora_flux.py", line 1926, in <module>
    main(args)
  File "/content/train_dreambooth_lora_flux.py", line 1720, in main
    model_pred = transformer(
                 ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/diffusers/models/transformers/transformer_flux.py", line 523, in forward
    hidden_states = block(
                    ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/diffusers/models/transformers/transformer_flux.py", line 96, in forward
    hidden_states = torch.cat([attn_output, mlp_hidden_states], dim=2)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 46.88 MiB is free. Process 68220 has 39.50 GiB memory in use. Of the allocated memory 38.84 GiB is allocated by PyTorch, and 172.60 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Steps:   0% 0/500 [00:01<?, ?it/s]

System Info

Google Colab A100 runtime environment

Who can help?

No response

Aravind-11 · 2025-04-02T01:11:44Z

I think the model is just too big for fine tuning and the attempts on the parameters to make it fit on the A100 gpu is too tight. It would be better just to use a smaller model.

a-r-r-o-w · 2025-04-02T19:48:22Z

@Aravind-11 It's possible to fully finetune flux under 24 GB, and do lora finetuning in under 8 GB or lower with clever offloading and other techniques.

@gurselnaziroglu Could you try with --gradient_checkpointing?

gurselnaziroglu · 2025-04-05T13:05:46Z

@Aravind-11 It's possible to fully finetune flux under 24 GB, and do lora finetuning in under 8 GB or lower with clever offloading and other techniques.

@gurselnaziroglu Could you try with --gradient_checkpointing?

Thanks for the suggestion. I tried it, but it didn't work. I'm still getting the same error in the same position. I would appreciate any further suggestions.

maosuli · 2025-04-11T03:26:44Z

Got the same issues on "torch.OutOfMemoryError: CUDA out of memory" when loading the flux_transformer into the 32GB GPU.

asomoza · 2025-04-11T07:03:39Z

Hi, I tried with your command but with the dog dataset which are a few images only, with what you're using it requires this:

~71 GB RAM
~42.9 GB VRAM

So first, there's no way to load it like this in a 32GB or a 40GB VRAM GPU, you will always get OOM, also with validation it will use even more, I tested it with a L40S which has 45GB and it barely fits in it without validation.

Also this is not a BUG, it clearly says in the README that you will need more than a 40GB VRAM GPU, so if you want to train Flux like this, you have these options:

Cache the embeddings before so you don't need to load the text encoders and the VAE, also don't use validation.
Use a bigger GPU
Use quantization

Probably your best solution here is to use a library that's just specially made for training which probably has more options to lower the VRAM, if you try to do it here, you will need to do some work (coding) to get it to run on something that is not an A100, H100 or superior GPUs.

gurselnaziroglu added the bug Something isn't working label Mar 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting CUDA out of memory error even with Colab A100 high RAM #11014

Getting CUDA out of memory error even with Colab A100 high RAM #11014

gurselnaziroglu commented Mar 9, 2025

Aravind-11 commented Apr 2, 2025

a-r-r-o-w commented Apr 2, 2025

gurselnaziroglu commented Apr 5, 2025

maosuli commented Apr 11, 2025

asomoza commented Apr 11, 2025 •

edited

Loading

Getting CUDA out of memory error even with Colab A100 high RAM #11014

Getting CUDA out of memory error even with Colab A100 high RAM #11014

Comments

gurselnaziroglu commented Mar 9, 2025

Describe the bug

Reproduction

Logs

System Info

Who can help?

Aravind-11 commented Apr 2, 2025

a-r-r-o-w commented Apr 2, 2025

gurselnaziroglu commented Apr 5, 2025

maosuli commented Apr 11, 2025

asomoza commented Apr 11, 2025 • edited Loading

asomoza commented Apr 11, 2025 •

edited

Loading