Training memory optimizations not working on AMD hardware #684

errnoh · 2022-09-30T10:17:56Z

Describe the bug

Dreambooth training example has a section about training on 16GB GPU. As Radeon Navi 21 series models all have 16GB available this in theory would increase the amount of hardware that can train models by a really large margin.

Problem is that at least out of the box neither of the optimizations --gradient_checkpointing nor --use_8bit_adam seem to support AMD cards.

Reproduction

Using the example command command with pytorch rocm 5.1.1 (pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1)

--gradient_checkpointing: returns error 'UNet2DConditionModel' object has no attribute 'enable_gradient_checkpointing'
--use_8bit_adam: throws handful of CUDA errors, see Logs section below for the main part (is bitsandbytes Nvidia specific and if it is is there an AMD implementation available?)

Logs

using --gradient_checkpointing:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `12` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
Traceback (most recent call last):
  File "/home/foobar/diffusers/examples/dreambooth/train_dreambooth.py", line 606, in <module>
    main()
  File "/home/foobar/diffusers/examples/dreambooth/train_dreambooth.py", line 408, in main
    unet.enable_gradient_checkpointing()
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'UNet2DConditionModel' object has no attribute 'enable_gradient_checkpointing'
Traceback (most recent call last):
  File "/home/foobar/diffusers/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

--use_8bit_adam:

...
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
  warn(
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary /home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
...



### System Info

- `diffusers` version: 0.3.0
- Platform: Linux-5.15.67-x86_64-with-glibc2.34
- Python version: 3.9.13
- PyTorch version (GPU?): 1.12.1+rocm5.1.1 (True)
- Huggingface_hub version: 0.9.1
- Transformers version: 4.22.2
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

The text was updated successfully, but these errors were encountered:

Gonzih · 2022-10-02T03:22:43Z

I have the same error on google colab, seems like UNet2DConditionalModel does not implement ModelMixin class which provides enable_gradient_checkpointing method

Seems like checkpointing was only recently implemented in the codebase and I assume wasn't released yet

Gonzih · 2022-10-02T03:24:17Z

Method was added in this commit (10 days ago) e7120ba

And latest release was 23 days ago.

Installing from main should solve the issue.

pip install git+https://github.com/huggingface/diffusers.git

Same issue and solution were already explained in #656, will leave my explanation just in case.

hopibel · 2022-10-02T16:01:58Z

Not related to AMD

The use_8bit_adam problems potentially are, as bitsandbytes includes a C extension which wraps some CUDA functions, i.e, doesn't run through pytorch-rocm. Not really anything that can be fixed on this end though

Gonzih · 2022-10-02T16:39:04Z

@hopibel thanks for clarification, updated my original comment

errnoh · 2022-10-03T15:39:28Z

@Gonzih thanks! That does indeed solve the issue with gradient checkpointing and actually gets really close to being able to train with AMD gpus. Dreambooth example actually starts this time, even though it still runs out of memory during the first iteration on 512x512, with this optimization current maximum without running out of memory with the example command is 384x384.

That said, there seems to be still some potential to optimize things further on AMD hardware. Is the idea behind use_8bit_adam CUDA specific, or is it just that there isn't an implementation for AMD cards? Also there are often mentions about DeepSpeed, which should be supported on rocm, but based on #599 there hasn't yet been bandwidth to implement on diffusers.

Any other alternatives that could be tested for lowering memory usage on Radeon cards?

errnoh · 2022-10-03T22:43:00Z

408x408 with https://github.com/ShivamShrirao/diffusers/tree/main/examples/dreambooth using --output_dir=$OUTPUT_DIR --with_prior_preservation --prior_loss_weight=1.0 --train_batch_size=1 --gradient_accumulation_steps=1 --learning_rate=5e-6 --lr_scheduler="constant" --lr_warmup_steps=0 --num_class_images=1500 --max_train_steps=800 --gradient_checkpointing --mixed_precision="fp16" --sample_batch_size=4 1.78it/s.

feffy380 · 2022-10-03T22:59:36Z

Is the idea behind use_8bit_adam CUDA specific, or is it just that there isn't an implementation for AMD cards?

Probably the latter, assuming bitsandbytes isn't doing anything weird. In theory one might be able to just compile a rocm version with hipcc, but the python code has some hardcoded checks for cuda libraries so it probably isn't that straightforward

errnoh · 2022-10-04T08:23:15Z

Probably the latter, assuming bitsandbytes isn't doing anything weird. In theory one might be able to just compile a rocm version with hipcc, but the python code has some hardcoded checks for cuda libraries so it probably isn't that straightforward

There's quite recent issue on bitsandbytes about rocm support: bitsandbytes-foundation/bitsandbytes#47

patrickvonplaten · 2022-10-04T09:10:48Z

We will release a new diffusers version very soon!

github-actions · 2022-10-30T15:02:26Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jarfeh · 2022-11-08T04:15:08Z

This still doesn't seem to have been resolved.

patrickvonplaten · 2022-11-09T20:41:47Z

@patil-suraj could you take a look here?

patil-suraj · 2022-11-10T12:49:46Z

I'm not really familiar with AMD GPUs, maybe @NouamaneTazi has some ideas here :)

github-actions · 2022-12-04T15:03:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jarfeh · 2022-12-11T02:44:19Z

I'm throwing a comment to try to keep the issue active, I'll eventually be able to try testing it again just in case it did get fixed

patrickvonplaten · 2022-12-13T17:08:47Z

Think we currently don't really have access to AMD hardware for testing, but we'd indeed by quite interested in getting AMD to work. Any pointers how to best proceed here? Maybe also cc @anton-l @mfuntowicz

Jarfeh · 2022-12-14T00:48:53Z

Last night, I came across this fork of bitsandbytes called bitsandbytes-rocm, and I can confirm that it does in-fact work on AMD hardware and allows me to at least utilize dreambooth, I have not tested any of the other projects in this repo. With an RX 6900XT, I successfully ran dreambooth at 13GB VRAM utilization without xformers.

github-actions · 2023-01-07T15:03:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

errnoh added the bug Something isn't working label Sep 30, 2022

github-actions bot added stale Issues that haven't received updates and removed bug Something isn't working labels Oct 30, 2022

github-actions bot closed this as completed Nov 7, 2022

patrickvonplaten reopened this Nov 9, 2022

patrickvonplaten assigned patil-suraj Nov 9, 2022

github-actions bot closed this as completed Jan 15, 2023

ClashSAN mentioned this issue Mar 10, 2023

AMD GPU's do not work bmaltais/kohya_ss#246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training memory optimizations not working on AMD hardware #684

Training memory optimizations not working on AMD hardware #684

errnoh commented Sep 30, 2022 •

edited

Loading

Gonzih commented Oct 2, 2022 •

edited

Loading

Gonzih commented Oct 2, 2022 •

edited

Loading

hopibel commented Oct 2, 2022 •

edited

Loading

Gonzih commented Oct 2, 2022

errnoh commented Oct 3, 2022

errnoh commented Oct 3, 2022

feffy380 commented Oct 3, 2022 •

edited

Loading

errnoh commented Oct 4, 2022

patrickvonplaten commented Oct 4, 2022

github-actions bot commented Oct 30, 2022

Jarfeh commented Nov 8, 2022

patrickvonplaten commented Nov 9, 2022

patil-suraj commented Nov 10, 2022

github-actions bot commented Dec 4, 2022

Jarfeh commented Dec 11, 2022

patrickvonplaten commented Dec 13, 2022

Jarfeh commented Dec 14, 2022 •

edited

Loading

github-actions bot commented Jan 7, 2023

Training memory optimizations not working on AMD hardware #684

Training memory optimizations not working on AMD hardware #684

Comments

errnoh commented Sep 30, 2022 • edited Loading

Describe the bug

Reproduction

Logs

Gonzih commented Oct 2, 2022 • edited Loading

Gonzih commented Oct 2, 2022 • edited Loading

hopibel commented Oct 2, 2022 • edited Loading

Gonzih commented Oct 2, 2022

errnoh commented Oct 3, 2022

errnoh commented Oct 3, 2022

feffy380 commented Oct 3, 2022 • edited Loading

errnoh commented Oct 4, 2022

patrickvonplaten commented Oct 4, 2022

github-actions bot commented Oct 30, 2022

Jarfeh commented Nov 8, 2022

patrickvonplaten commented Nov 9, 2022

patil-suraj commented Nov 10, 2022

github-actions bot commented Dec 4, 2022

Jarfeh commented Dec 11, 2022

patrickvonplaten commented Dec 13, 2022

Jarfeh commented Dec 14, 2022 • edited Loading

github-actions bot commented Jan 7, 2023

errnoh commented Sep 30, 2022 •

edited

Loading

Gonzih commented Oct 2, 2022 •

edited

Loading

Gonzih commented Oct 2, 2022 •

edited

Loading

hopibel commented Oct 2, 2022 •

edited

Loading

feffy380 commented Oct 3, 2022 •

edited

Loading

Jarfeh commented Dec 14, 2022 •

edited

Loading