[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

sayakpaul · 2024-06-03T08:03:07Z

Context

Refer to #8312 for the full context. The changes introduced in the PR should be propagated to the following scripts, too:

advanced_diffusion_training
- train_dreambooth_lora_sd15_advanced.py
- train_dreambooth_lora_sdxl_advanced.py
consistency_distillation
- train_lcm_distill_lora_sdxl.py
controlnet
- train_controlnet.py
- train_controlnet_sdxl.py
custom_diffusion
- train_custom_diffusion.py
dreambooth
- train_dreambooth.py
- train_dreambooth_lora.py
- train_dreambooth_lora_sdxl.py
instruct_pix2pix
- train_instruct_pix2pix.py
- rain_instruct_pix2pix_sdxl.py
kandinsky2_2/text_to_image
- train_text_to_image_decoder.py
- train_text_to_image_prior.py
- train_text_to_image_lora_decoder.py
- train_text_to_image_lora_prior.py
t2i_adapter
- train_t2i_adapter_sdxl.py
text_to_image
- train_text_to_image.py
- train_text_to_image_sdxl.py
- train_text_to_image_lora.py
- train_text_to_image_lora_sdxl.py
textual_inversion
- textual_inversion.py
- textual_inversion_sdxl.py
unconditional_image_generation
- train_unconditional.py
wuerstchen
- text_to_image/train_text_to_image_prior.py
- text_to_image/train_text_to_image_lora_prior.py
research_projects (low-priority)
- consistency_training/train_cm_ct_unconditional.py
- diffusion_dpo/train_diffusion_dpo.py
- diffusion_dpo/train_diffusion_dpo_sdxl.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora.py
- dreambooth_inpaint/train_dreambooth_inpaint.py
- dreambooth_inpaint/train_dreambooth_inpaint_lora.py
- instructpix2pix_lora/train_instruct_pix2pix_lora.py
- intel_opts/textual_inversion/textual_inversion_bf16.py
- intel_opts/textual_inversion_dfq/textual_inversion.py
- lora/train_text_to_image_lora.py
- multi_subject_dreambooth/train_multi_subject_dreambooth.py
- multi_token_textual_inversion/textual_inversion.py
- onnxruntime/text_to_image/train_text_to_image.py
- onnxruntime/textual_inversion/textual_inversion.py
- onnxruntime/unconditional_image_generation/train_unconditional.py
- realfill/train_realfill.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora.py
- scheduled_huber_loss_training/dreambooth/train_dreambooth_lora_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_sdxl.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora.py
- scheduled_huber_loss_training/text_to_image/train_text_to_image_lora_sdxl.py

The following scripts do not have the argument --num_train_epochs:

amused
- train_amused.py
research_projects
- multi_subject_dreambooth_inpainting/train_multi_subject_dreambooth_inpainting.py

So, they don't need to be updated.

Then we have the following scripts that don't use accelerator to prepare the datasets:

Distributed dataset sharding is done by WebDataset, not accelerator. So, we can skip them for now.

consistency_distillation
- train_lcm_distill_sd_wds.py
- train_lcm_distill_sdxl_wds.py
- train_lcm_distill_lora_sd_wds.py
- train_lcm_distill_lora_sdxl_wds.py
research_projects
- controlnet/train_controlnet_webdataset.py
- diffusion_orpo/train_diffusion_orpo_sdxl_lora_wds.py

Steps to follow when opening PRs

Target one AND only one training script in a single PR.
When you open a PR, please mention this issue.
Mention @sayakpaul and @geniuspatrick for a review.
Accompany your PR with a minimal training command using the num_train_epochs CLI arg.
Enjoy!

The text was updated successfully, but these errors were encountered:

a-r-r-o-w · 2024-11-20T02:43:46Z

@sayakpaul Are all of the above resolved? Maybe for the ones that are remaining, we can complete it ourselves - in which case, do you know which ones are remaining?

sayakpaul · 2024-11-20T03:27:28Z

I think we can still have this opened as it's good chance for the community for contributions.

kghamilton89 · 2025-04-04T06:45:44Z

@sayakpaul is this issue still looking for contribution?

sayakpaul · 2025-04-08T06:13:06Z

@kghamilton89 I just updated the list of the scripts that have been updated. So, yes, very much open for contributions.

kghamilton89 · 2025-04-09T06:25:48Z

@sayakpaul, the above PRs close out the DreamBooth trainers.

sayakpaul added Good second issue contributions-welcome labels Jun 3, 2024

rootonchair mentioned this issue Jun 9, 2024

[train_lcm_distill_lora_sdxl.py] Fix the LR schedulers when num_train_epochs is passed in a distributed training env #8446

Merged

6 tasks

This was referenced Jun 10, 2024

[train_controlnet.py] Fix the LR schedulers when num_train_epochs is passed in a distributed training env #8461

Merged

[train_controlnet_sdxl.py] Fix the LR schedulers when num_train_epochs is passed in a distributed training env #8476

Merged

WenheLI mentioned this issue Jun 13, 2024

fix the LR schedulers for dreambooth_lora #8510

Merged

6 tasks

sayantan1410 mentioned this issue Aug 6, 2024

fix for lr scheduler in distributed training #9103

Merged

flyxiv mentioned this issue Mar 5, 2025

[train_dreambooth_lora.py] Fix the LR Schedulers when num_train_epochs is passed in a distributed training env #10973

Merged

6 tasks

This was referenced Apr 9, 2025

[train_dreambooth.py] Fix the LR Schedulers when num_train_epochs is passed in a distributed training env #11239

Open

[train_dreambooth_lora_sdxl.py] Fix the LR Schedulers when num_train_epochs is passed in a distributed training env #11240

Merged

linoytsaban mentioned this issue Apr 9, 2025

[Flux LoRAs] fix lr scheduler bug in distributed scenarios #11242

Merged

sayakpaul closed this as completed in #11240 Apr 21, 2025

sayakpaul reopened this Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

sayakpaul commented Jun 3, 2024 •

edited

Loading

a-r-r-o-w commented Nov 20, 2024

sayakpaul commented Nov 20, 2024

kghamilton89 commented Apr 4, 2025

sayakpaul commented Apr 8, 2025

kghamilton89 commented Apr 9, 2025

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

[Community] Help us fix the LR schedulers when num_train_epochs is passed in a distributed training env #8384

Comments

sayakpaul commented Jun 3, 2024 • edited Loading

Context

Steps to follow when opening PRs

a-r-r-o-w commented Nov 20, 2024

sayakpaul commented Nov 20, 2024

kghamilton89 commented Apr 4, 2025

sayakpaul commented Apr 8, 2025

kghamilton89 commented Apr 9, 2025

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

[Community] Help us fix the LR schedulers when `num_train_epochs` is passed in a distributed training env #8384

sayakpaul commented Jun 3, 2024 •

edited

Loading