-
Notifications
You must be signed in to change notification settings - Fork 619
replace adamW and pagedadam with 8bitpagedadam or torchao CPUOffloadOptimizer #1576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
These are just for our full finetune low memory configs right? I almost wonder if we should re-benchmark this recipe with all the new memory optimizations that have been coming in. |
@felipemello1 can you please cite the source? I'm deciding between optimizers too. |
@NeuralFlux I dont have it :/ But what i heard from some other coworkers is that they didnt observe change in the loss. Are you doing full finetuning? If so, you need PagedAdam/8bit to save memory. But if you are using LoRA, you dont need pagedAdam, since the gradients are not your bottleneck. You can just use AdamW with fused=True. |
No worries! I'm doing QLoRA but keep running into OOM because of big sequence lengths. I'm using pagedAdam to save how much ever memory I can. I noticed we are not compatible with |
How big is the sequence length? Also, are you using torchtune nighlities? If you arent, please try this:
And run your model with compile=True and enable_activation_checkpoint=True you should see a huge difference in memory/tokens per second |
Make sure that your config is using the loss= chunked cross entropy, like we have in our default configs |
Sure! Also, I noticed QLoRA config mentions dtype as I tried installing but pip still tells me
|
hmm, maybe try a fresh environment? conda create -n your_env_name python=3.10
compute is done in bf16. The quantize layers that are not being trained are stored in NF4, which saves 15GiB -> 5GiB if your model is llama 8B |
Gotcha, I will try that soon (weekend's about to start here haha). Have a good weekend! |
Hi @felipemello1 . I launched a job over the weekend that worked. Setting |
do you know difference between PagedAdEMAMix8bit and bitsandbytes.optim.AdEMAMix8bit |
Hi @FurkanGozukara I'm not familiar with PagedAdEMAMix8bit, can you share a link? |
|
@FurkanGozukara thanks for sharing the link. Personally I'm not too familiar with these optimizers as we mostly use ones that are available in bitsandbytes. Maybe best to ask on the sd-scripts repo directly? |
thanks. paged seems like automatically using entire vram but not throwing out of vram really good :) |
Going to close this issue as it's more of a question for sd-scripts. @FurkanGozukara feel free to reopen if you need any more assistance on optimizer usage in torchtune |
Uh oh!
There was an error while loading. Please reload this page.
Apparently there is no reason to use paged adam instead of the 8bit version. We could replace it.
Also, full finetune single device should use paged adam, instead of adamw, for better memory.
For single device, we have torchaos version that is faster than the one from bitsandbytes: https://github.com/pytorch/ao/blob/8236a874479a9a9168e584c81dda8707f4c41006/torchao/prototype/low_bit_optim/cpu_offload.py#L9
The text was updated successfully, but these errors were encountered: